# Hybrid Search with Reranking and Filtering
_Last updated: May 31, 2025_

This notebook demonstrates advanced hybrid search techniques combining three different [embedding](https://en.wikipedia.org/wiki/Word_embedding) approaches: **dense**, **sparse**, and **late interaction**. It uses dense and sparse embeddings for initial retrieval, then late interaction for reranking. It also filters by `user_id` to simulate multitenancy.

### What is Hybrid Search?
Hybrid search combines multiple retrieval methods to improve search quality and relevance. Instead of relying on a single approach, we leverage the strengths of different embedding techniques:

- **Dense Embeddings** (Semantic Search): Uses neural networks to create dense vector representations of unstructured data, capturing semantic meaning and context. Great for finding conceptually similar content.

- **Sparse Embeddings** (Keyword Search): Uses keyword-based search through BM25 for finding exact matches in words. It's great for maintaining interpretability and precision, especially in use cases where industry-specific terms are used.

- **Late Interaction Embeddings** (Advanced Semantic): Allows each [token](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation) to get it's own embedding, enabling precise matching and fine-grained interactions. It combines the benefits of dense retrieval and token-level precision, making it great for reranking in this workflow.

**Why?** By combining these three approaches, we get more robustness to different types of queries to the vector database

### References and Resources
The following were used to complete this project:
- [Qdrant Reranking in Hybrid Search](https://qdrant.tech/documentation/advanced-tutorials/reranking-hybrid-search/?q=ingest#ingestion-stage)
- [Qdrant 'Concepts' Documentation](https://qdrant.tech/documentation/concepts/)
- [How to Build the Ultimate Hybrid Search with Qdrant (video)](https://www.youtube.com/live/LAZOxqzceEU?si=4HF34v9G1xq3Z3-6)
- [Anthropic's Claude](https://claude.ai/new) (for coding support like troubleshooting, debugging, cleanup)
- [Hugging Face Datasets](https://huggingface.co/datasets?modality=modality:text&sort=trending) (for source documents)

## Environment Setup

First we install all the required libraries for this notebook:

- **qdrant-client**: Qdrant's vector database client for storing and retrieving embeddings
- **fastembed**: Qdrant's opensource, lightweight, and comprehensive library for generating different embedding types
- **tqdm**: Visual progress bars to track long-running operations
- **polars**: Fast dataframe library for data manipulation

We then import all the libraries and modules we need

In [3]:
# Install required libraries
%pip install qdrant-client fastembed tqdm polars

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Import dependencies
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding, LateInteractionTextEmbedding, SparseTextEmbedding 
import polars as pl
from tqdm import tqdm
from getpass import getpass
import random, os

  from .autonotebook import tqdm as notebook_tqdm


## Create Embeddings

This section covers the creation of our three different types of embeddings. Each embedding type captures different aspects of the source text:

- **Dense embeddings**: Fixed-size vectors, mostly non-zero, capturing semantic meaning
- **Sparse embeddings**: Variable-size vectors, mostly zero, capturing keyword term-frequency information.
- **Late interaction embeddings**: Multiple vectors per document for precise relevance with contextual understanding.

### Embedding Setup

We must first initialize our embedding models, then we can load our document dataset to start embedding. The source documents for this project were **arxiv paper abstracts**, found on [Hugging Face](https://huggingface.co/datasets/bluuebunny/arxiv_abstract_embedding_mxbai_large_v1_milvus_binary) and published by Mitanshu Sukhwani in 2025. This dataset provides a rich corpus of scientific text for demonstrating text retrieval. We random sample 1 million abstracts from this dataset to reduce overall embeddings to the minimum required volume.

In [5]:
# Initialize embedding models
dense_embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
bm25_embedding_model = SparseTextEmbedding("Qdrant/bm25")
late_interaction_embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")

In [6]:
# Load 1 Million Documents (abstracts)
documents = pl.read_parquet('hf://datasets/bluuebunny/arxiv_abstract_embedding_mxbai_large_v1_milvus_binary/**/*.parquet')
documents = documents['abstract'].sample(1000)
print(documents)
print(f"Total documents loaded: {len(documents)}")

shape: (1_000,)
Series: 'abstract' [str]
[
	"  We present a combined experi…
	"  In a companion paper we have…
	"  A recent work has shown that…
	"  Thermoelectric (TE) conversi…
	"  The emergence of flat bands …
	…
	"  A nonequilibrium Green's fun…
	"  In this paper, we propose a …
	"  The ALICE experiment at LHC …
	"  We generalize the scattering…
	"  Aims. We tested the new atom…
]
Total documents loaded: 1000


### Generate Actual Embeddings

Now that our data and models are prepared, we're ready to start generating embeddings. **This is the most compute intensive task in the workflow, and may take some time to complete!**

**Time required on CPU**
- Dense Embeddings: ~20 seconds per 1000 docs
- Sparse Embeddings: ~1 second per 1000 docs
- Late Interaction Embeddings: 3~4 minutes per 1000 docs (token-level embeddings)

In [7]:
# Generate embeddings for all documents
print("Generating embeddings...")

dense_embeddings = []
for doc in tqdm(documents, desc="Dense Embeddings"):
    embedding = next(dense_embedding_model.embed(doc))
    dense_embeddings.append(embedding)

bm25_embeddings = []
for doc in tqdm(documents, desc="BM25 Embeddings"):
    embedding = next(bm25_embedding_model.embed(doc))
    bm25_embeddings.append(embedding)

late_interaction_embeddings = []
for doc in tqdm(documents, desc="Late Interaction Embeddings"):
    embedding = next(late_interaction_embedding_model.embed(doc))
    late_interaction_embeddings.append(embedding)

Generating embeddings...


Dense Embeddings: 100%|██████████| 1000/1000 [00:19<00:00, 50.64it/s]
BM25 Embeddings: 100%|██████████| 1000/1000 [00:00<00:00, 3425.00it/s]
Late Interaction Embeddings: 100%|██████████| 1000/1000 [03:24<00:00,  4.89it/s]


In [8]:
# Check shapes and types
print(f"Dense embedding shape: {dense_embeddings[0].shape}")
print(f"BM25 embedding type: {type(bm25_embeddings[0])}")
print(f"Late interaction embedding shape: {late_interaction_embeddings[0].shape}")

Dense embedding shape: (384,)
BM25 embedding type: <class 'fastembed.sparse.sparse_embedding_base.SparseEmbedding'>
Late interaction embedding shape: (318, 128)


## Using Qdrant Cloud Vector Database

Qdrant is a high-performance vector database optimized for similarity search. We're using it because has:

- Multi-vector support
- Hybrid search
- Filtering
- Scalability
- Performance

(and of course, because it's required for the project!)

### Setting up Qdrant

We are using Qdrant Cloud for this excercise, which requires an endpoint and API key to access the cluster. The code below prompts the user for the information (rather than hardcode, presenting security risks), and then creates a [collection](https://qdrant.tech/documentation/concepts/collections/). A collection is a named set of points (vectors with a payload) among which you can search. A [tenant index](https://qdrant.tech/documentation/guides/multiple-partitions/#configure-multitenancy) is also created to allow filtering by user_id.

In [9]:
# Configure up Qdrant endpoint and API key
QDRANT_ENDPOINT = (
    os.environ["QDRANT_ENDPOINT"]
    if "QDRANT_ENDPOINT" in os.environ
    else input("Qdrant endpoint: ")
)
QDRANT_API_KEY = (
    os.environ["QDRANT_API_KEY"]
    if "QDRANT_API_KEY" in os.environ
    else getpass("Qdrant API key: ")
)

COLLECTION_NAME = "hybrid-search"

# Make connection
client = QdrantClient(
    url=QDRANT_ENDPOINT,
    api_key=QDRANT_API_KEY
)

In [10]:
# Delete existing collection if it exists
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Deleted existing collection: {COLLECTION_NAME}")
except:
    pass

# Create collection
client.create_collection(
    COLLECTION_NAME,
    vectors_config={
        "dense": models.VectorParams(
            size=384,  # all-MiniLM-L6-v2 embedding size
            distance=models.Distance.COSINE,
            quantization_config=models.BinaryQuantization(
                binary=models.BinaryQuantizationConfig(
                    always_ram=True
                )
            )
        ),
        "colbert": models.VectorParams(
            size=128,  # ColBERT embedding dimension
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM,
            )
        ),
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(modifier=models.Modifier.IDF)
    }
)
print(f"Created new collection: {COLLECTION_NAME}")

Deleted existing collection: hybrid-search
Created new collection: hybrid-search


In [11]:
# Create user_id index for filtering
client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="user_id",
    field_schema=models.KeywordIndexParams(
        type=models.KeywordIndexType.KEYWORD,
        is_tenant=True,
    ),
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

### Point Creation
Each "point" in Qdrant represents a document with all its associated vectors and metadata:

#### Point Structure:
- **ID**: Unique identifier for the document  
- **Vectors**: All three embedding types stored together  
- **Payload**: Document text and metadata (including user_id for filtering)  

#### Simulated Multi-User Environment:
We're randomly assigning user IDs (`user_1` through `user_10`) to simulate a multi-tenant application where users should only see their own documents.

#### Vector Format Conversion:
- **Dense**: Convert numpy array to list  
- **BM25**: Use `.as_object()` method for sparse format  
- **Late Interaction**: Convert 2D numpy array to nested list  

This structure enables both vector similarity search and metadata filtering in a single query.


In [12]:
from qdrant_client.models import PointStruct

# Point creation
points = []
for idx, (dense_embedding, bm25_embedding, late_interaction_embedding, doc) in enumerate(
    zip(dense_embeddings, bm25_embeddings, late_interaction_embeddings, documents)
):
    # Generate a random user_id for demo
    user_id = f"user_{random.randint(1, 10)}"
    
    point = PointStruct(
        id=idx,
        vector={
            "dense": dense_embedding.tolist(),
            "bm25": bm25_embedding.as_object(),
            "colbert": late_interaction_embedding.tolist(),
        },
        payload={
            "document": doc,
            "user_id": user_id
        }
    )
    points.append(point)

### Ingesting Data with Qdrant

Here, we send the data to our Qdrant vector database cluster using a device-balanced `batch_size` for performance optimization.

In [13]:
# Batch upsert for better performance
batch_size = 25
for i in tqdm(range(0, len(points), batch_size), desc="Uploading to Qdrant"):
    batch = points[i:i + batch_size]
    client.upsert(collection_name=COLLECTION_NAME, points=batch)

Uploading to Qdrant: 100%|██████████| 40/40 [02:14<00:00,  3.37s/it]


### Retrieve Vectors from Qdrant

Now for the fun stuff! Let's see how well we can retrieve content from Qdrant using a query.

The query is both specific and general, with semantic meaning. This is something a traditional database would not be able to handle effectively. The results are filtered for hypothetical `user_3`.

#### Query Strategy:
1. **Prefetch** with dense and sparse embeddings to get candidate documents  
2. **Rerank** using late interaction embeddings for final precision  
3. **Filter** results by user ownership for multi-tenant security  

This approach combines the strengths of all three methods while maintaining security boundaries.


In [14]:
# Enter query and user_id filter
query = "What are the most interesting galaxies in the universe?"
target_user_id = "user_3"

The query itself must be converted to an embedding so that Approximate Nearest Neighbor (ANN) search can find the most similar content.

In [15]:
# Embed the query into vector space
dense_vectors = next(dense_embedding_model.query_embed(query))
sparse_vectors = next(bm25_embedding_model.query_embed(query))
late_vectors = next(late_interaction_embedding_model.query_embed(query))

In [16]:
# Create prefetch that will find candidate documents from multiple vector types
prefetch = [
        models.Prefetch(
            query=dense_vectors,
            using="dense",
            limit=50,
        ),
        models.Prefetch(
            query=models.SparseVector(**sparse_vectors.as_object()),
            using="bm25",
            limit=50,
        ),
    ]

In [17]:
# Create user_id filter using the indexed field
user_filter = models.Filter(
    must=[
        models.FieldCondition(
            key="user_id",
            match=models.MatchValue(value=target_user_id)
        )
    ]
)

In [18]:
# Run final query
results = client.query_points(
    COLLECTION_NAME,
    prefetch=prefetch,
    query=late_vectors,
    using="colbert",
    with_payload=True,
    limit=50,
    query_filter=user_filter
)

results

QueryResponse(points=[ScoredPoint(id=702, version=30, score=16.326674, payload={'document': '  High redshift galaxy clusters have traditionally been a fruitful place to study galaxy evolution. I review various search strategies for finding clusters at z > 1. Most efforts to date have concentrated on the environments of distant AGN. I illustrate these with data on the cluster around 3C 324 (z=1.2) and other, more distant systems, and discuss possibilities for future surveys with large telescopes. ', 'user_id': 'user_3'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=726, version=31, score=14.451191, payload={'document': '  Using a subsample of 79 nearby clusters from the RASS-SDSS galaxy cluster catalogue of Popesso et al. (2005a), we perform a regression analysis between the cluster integrated star formation rate (Sigma_SFR) the cluster total stellar mass (M_star), the fractions of star forming (f_SF) and blue (f_b) galaxies and other cluster global properties, namely 

### Display Results

To better analyze our results, I've cleaned them up here into a pretty format from Polars.

In [19]:
output = []
for point in results.points:
    output.append({
        'id': point.id,
        'score': point.score,
        'user_id': point.payload.get('user_id', ''),
        'payload': point.payload.get('document', '')
    })

df = pl.DataFrame(output)
pl.Config.set_fmt_str_lengths(200) # Show up to 200 characters from abstract
print(df)


shape: (50, 4)
┌─────┬───────────┬─────────┬──────────────────────────────────────────────────────────────────────┐
│ id  ┆ score     ┆ user_id ┆ payload                                                              │
│ --- ┆ ---       ┆ ---     ┆ ---                                                                  │
│ i64 ┆ f64       ┆ str     ┆ str                                                                  │
╞═════╪═══════════╪═════════╪══════════════════════════════════════════════════════════════════════╡
│ 702 ┆ 16.326674 ┆ user_3  ┆ High redshift galaxy clusters have traditionally been a fruitful     │
│     ┆           ┆         ┆ place to study galaxy evolution. I review various search strategies  │
│     ┆           ┆         ┆ for finding clusters at z > 1. Most efforts to date have concentr…   │
│ 726 ┆ 14.451191 ┆ user_3  ┆ Using a subsample of 79 nearby clusters from the RASS-SDSS galaxy    │
│     ┆           ┆         ┆ cluster catalogue of Popesso et al. (2005a), w

## Conclusion and Takeaways

### What We've Accomplished

**Multi-Modal Embeddings**  
- Dense embeddings for semantic understanding  
- Sparse embeddings for keyword precision  
- Late interaction embeddings for fine-grained relevance  

**Production-Ready Architecture**  
- Scalable vector database with Qdrant  
- Efficient batch processing and indexing  
- Multi-tenant security with user filtering  

**Advanced Search Capabilities**  
- Hybrid retrieval combining multiple signals  
- Sophisticated ranking and reranking  
- Flexible query processing pipeline  

### Potential Enhancements

- Implement query-time filtering for better performance  
- Add caching for frequently accessed embeddings
- Add GPU acceleration to embedding generation 
- Parallelize long-running processes for faster execution 
- Integrate with LLMs for a more conversation experience (RAG) 

### Key Takeaways

1. **Hybrid approaches outperform single methods** by combining different strengths  
2. **Late interaction models** provide exceptional precision for text search  
3. **Vector databases** enable sophisticated multi-modal search at scale  
4. **Security considerations** are crucial for multi-tenant applications  

This hybrid search system provides a solid foundation for building production-grade search applications that deliver both high recall and precision across diverse query types.

Thanks for the fun project!
