# Hybrid Search with Reranking and Filtering
_Last updated: June 1, 2025_

This notebook demonstrates advanced hybrid search techniques combining three different [embedding](https://en.wikipedia.org/wiki/Word_embedding) approaches: **dense**, **sparse**, and **late interaction**. It uses dense and sparse embeddings for initial retrieval, then late interaction for reranking. It also filters by `user_id` to simulate multitenancy.

### What is Hybrid Search?
Hybrid search combines multiple retrieval methods to improve search quality and relevance. Instead of relying on a single approach, we leverage the strengths of different embedding techniques:

- **Dense Embeddings** (Semantic Search): Uses neural networks to create dense vector representations of unstructured data, capturing semantic meaning and context. Great for finding conceptually similar content.

- **Sparse Embeddings** (Keyword Search): Uses keyword-based search through BM25 for finding exact matches in words. It's great for maintaining interpretability and precision, especially in use cases where industry-specific terms are used.

- **Late Interaction Embeddings** (Advanced Semantic): Allows each [token](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation) to get it's own embedding, enabling precise matching and fine-grained interactions. It combines the benefits of dense retrieval and token-level precision, making it great for reranking in this workflow.

**Why?** By combining these three approaches, we get more robustness to different types of queries to the vector database

### References and Resources
The following were used to complete this project:
- [Qdrant Reranking in Hybrid Search](https://qdrant.tech/documentation/advanced-tutorials/reranking-hybrid-search/?q=ingest#ingestion-stage)
- [Qdrant 'Concepts' Documentation](https://qdrant.tech/documentation/concepts/)
- [How to Build the Ultimate Hybrid Search with Qdrant (video)](https://www.youtube.com/live/LAZOxqzceEU?si=4HF34v9G1xq3Z3-6)
- [Anthropic's Claude](https://claude.ai/new) (for coding support like troubleshooting, debugging, cleanup)
- [Hugging Face Datasets](https://huggingface.co/datasets?modality=modality:text&sort=trending) (for source documents)

## Environment Setup

First we install all the required libraries for this notebook:

- **qdrant-client**: Qdrant's vector database client for storing and retrieving embeddings
- **fastembed**: Qdrant's opensource, lightweight, and comprehensive library for generating different embedding types
- **fastembed-gpu**: The version of fastembed that utilizes GPU acceleration (when available)
- **tqdm**: Visual progress bars to track long-running operations
- **polars**: Fast dataframe library for data manipulation
- **torch**: PyTorch for GPU acceleration (when available)
- **rerankers**: Comprehensive library of rerankers, including ColBERT

We then import all the libraries and modules we need

In [1]:
# Install required libraries
%pip install qdrant-client fastembed fastembed-gpu tqdm polars torch rerankers



In [2]:
# Import dependencies
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding, LateInteractionTextEmbedding, SparseTextEmbedding
import polars as pl
from tqdm import tqdm
from getpass import getpass
import random, os
import torch
from rerankers import Reranker

## Create Embeddings

This section covers the creation of two different types of embeddings needed for hybrid search. Each embedding type captures different aspects of the source text:

- **Dense embeddings**: Fixed-size vectors, mostly non-zero, capturing semantic meaning
- **Sparse embeddings**: Variable-size vectors, mostly zero, capturing keyword term-frequency information.

### Embedding Setup

We must first initialize our embedding models, then we can load our document dataset to start embedding. The source documents for this project were **arxiv paper abstracts**, found on [Hugging Face](https://huggingface.co/datasets/bluuebunny/arxiv_abstract_embedding_mxbai_large_v1_milvus_binary) and published by Mitanshu Sukhwani in 2025. This dataset provides a rich corpus of scientific text for demonstrating text retrieval. We random sample 1 million abstracts from this dataset to reduce overall embeddings to the minimum required volume.

In [3]:
# Initialize embedding models
dense_embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
bm25_embedding_model = SparseTextEmbedding("Qdrant/bm25")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Load 1 Million Documents (arxiv abstracts)
documents = pl.read_parquet('hf://datasets/bluuebunny/arxiv_abstract_embedding_mxbai_large_v1_milvus_binary/**/*.parquet')
documents = documents['abstract'].sample(10000)
print(documents)
print(f"Total documents loaded: {len(documents)}")

shape: (10_000,)
Series: 'abstract' [str]
[
	"  Coulomb collisions in partic…
	"  Reflected light curves obser…
	"  Our goals are (i) to search …
	"  To search for possible textu…
	"  Hartman and Nissim-Sabat hav…
	…
	"  With the fast development of…
	"  We propose a new method in w…
	"  We study the transverse self…
	"  The integer division of a nu…
	"  Here we present an algorithm…
]
Total documents loaded: 10000


### Generate Actual Embeddings

Now that our data and models are prepared, we're ready to start generating embeddings. **This is the most compute intensive task in the workflow, and may take some time to complete!**

**Time required on CPU**
- Dense Embeddings: ~20 seconds per 1000 docs
- Sparse Embeddings: ~1 second per 1000 docs

**GPU Usage**

Before generating embeddings, we check for GPU availability and set appropriate batch sizes. GPU acceleration significantly speeds up embedding generation, especially for dense embeddings.

In [5]:
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"Current GPU: {torch.cuda.get_device_name()}")

CUDA available: False
GPU count: 0


In [6]:
# Set batch_sized based on GPU availability
if torch.cuda.is_available():
    batch_size = 200
else:
    batch_size = 32

print(f"Batch size: {batch_size}")

Batch size: 32


In [7]:
# Generate embeddings for all documents; use batching to optimize performance
print("Generating embeddings...")

def generate_embeddings(model, documents, batch_size=batch_size, desc="Embeddings"):
    embeddings = []

    with tqdm(total=len(documents), desc=desc, unit="doc") as pbar:
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            batch_embeddings = list(model.embed(batch))
            embeddings.extend(batch_embeddings)
            pbar.update(len(batch))

    return embeddings

dense_embeddings = generate_embeddings(
    dense_embedding_model, documents, batch_size=batch_size, desc="Dense Embeddings"
)

bm25_embeddings = generate_embeddings(
    bm25_embedding_model, documents, batch_size=batch_size, desc="Sparse Embeddings"
)

Generating embeddings...


Dense Embeddings: 100%|██████████| 10000/10000 [02:17<00:00, 72.92doc/s]
Sparse Embeddings: 100%|██████████| 10000/10000 [00:02<00:00, 4104.61doc/s]


In [8]:
# Check shapes and types
print(f"Dense embedding shape: {dense_embeddings[0].shape}")
print(f"BM25 embedding type: {type(bm25_embeddings[0])}")

Dense embedding shape: (384,)
BM25 embedding type: <class 'fastembed.sparse.sparse_embedding_base.SparseEmbedding'>


## Using Qdrant Cloud Vector Database

Qdrant is a high-performance vector database optimized for similarity search. We're using it because has:

- Multi-vector support
- Hybrid search
- Filtering
- Scalability
- Performance
- Cloud hosting

(and of course, because it's required for the project!)

### Setting up Qdrant

We are using Qdrant Cloud for this excercise, which requires an endpoint and API key to access the cluster. The code below prompts the user for the information (rather than hardcode, presenting security risks), and then creates a [collection](https://qdrant.tech/documentation/concepts/collections/). A collection is a named set of points (vectors with a payload) among which you can search. Binary quantization is used to reduce memory usage while maintaining search quality. Lastly, a [tenant index](https://qdrant.tech/documentation/guides/multiple-partitions/#configure-multitenancy) is also created to allow filtering by user_id.

In [9]:
# Configure up Qdrant endpoint and API key
QDRANT_ENDPOINT = (
    os.environ["QDRANT_ENDPOINT"]
    if "QDRANT_ENDPOINT" in os.environ
    else input("Qdrant endpoint: ")
)
QDRANT_API_KEY = (
    os.environ["QDRANT_API_KEY"]
    if "QDRANT_API_KEY" in os.environ
    else getpass("Qdrant API key: ")
)

COLLECTION_NAME = "hybrid-search"

# Make connection
client = QdrantClient(
    url=QDRANT_ENDPOINT,
    api_key=QDRANT_API_KEY
)

Qdrant endpoint: https://16325403-8fa2-40e5-8236-03ad0b059833.us-west-2-0.aws.cloud.qdrant.io
Qdrant API key: ··········


In [10]:
# Delete existing collection if it exists
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Deleted existing collection: {COLLECTION_NAME}")
except:
    pass

# Create collection
client.create_collection(
    COLLECTION_NAME,
    vectors_config={
        "dense": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
            quantization_config=models.BinaryQuantization(
                binary=models.BinaryQuantizationConfig(
                    always_ram=True
                )
            )
        ),
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(modifier=models.Modifier.IDF)
    }
)
print(f"Created new collection: {COLLECTION_NAME}")

Deleted existing collection: hybrid-search
Created new collection: hybrid-search


In [11]:
# Create user_id index for filtering
client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="user_id",
    field_schema=models.KeywordIndexParams(
        type=models.KeywordIndexType.KEYWORD,
        is_tenant=True,
    ),
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

### Point Creation
Each "point" in Qdrant represents a document with all its associated vectors and metadata:

#### Point Structure:
- **ID**: Unique identifier for the document  
- **Vectors**: Both embedding types stored together  
- **Payload**: Document text and metadata (including user_id for filtering)  

#### Simulated Multi-User Environment:
We're randomly assigning user IDs (`user_1` through `user_10`) to simulate a multi-tenant application where users should only see their own documents.

**This structure enables both vector similarity search and metadata filtering in a single query.**


In [12]:
from qdrant_client.models import PointStruct

# Point creation
points = []
for idx, (dense_embedding, bm25_embedding, doc) in enumerate(
    zip(dense_embeddings, bm25_embeddings, documents)
):
    # Generate a random user_id for demo
    user_id = f"user_{random.randint(1, 10)}"

    point = PointStruct(
        id=idx,
        vector={
            "dense": dense_embedding.tolist(),
            "bm25": bm25_embedding.as_object(),
        },
        payload={
            "document": doc,
            "user_id": user_id
        }
    )
    points.append(point)

### Ingesting Data with Qdrant

Here, we send the data to our Qdrant vector database cluster using a memory-and-network balanced `batch_size` for performance optimization.

In [13]:
# Batch upsert for better performance
batch_size = 25
for i in tqdm(range(0, len(points), batch_size), desc="Uploading to Qdrant"):
    batch = points[i:i + batch_size]
    client.upsert(collection_name=COLLECTION_NAME, points=batch)

Uploading to Qdrant: 100%|██████████| 400/400 [00:36<00:00, 10.97it/s]


### Retrieve Vectors from Qdrant

Now for the fun stuff! Let's see how well we can retrieve content from Qdrant using a query.

The query is both specific and general, with semantic meaning. This is something a traditional database would not be able to handle effectively. The results are filtered for hypothetical `user_3`.

#### Query Strategy:
1. Convert the query into both dense and sparse embeddings
2. Use these embeddings to find candidate documents
3. Apply user filtering for security
4. Rerank results for optimal relevance  

This approach combines the strengths of all three embedding methods while maintaining security boundaries.


In [14]:
# Enter query and user_id filter
query = "What are the most interesting galaxies in the universe?"
target_user_id = "user_3"

The query itself must be converted to an embedding so that Approximate Nearest Neighbor (ANN) search can find the most similar content.

In [15]:
# Embed the query into vector space
dense_vectors = next(dense_embedding_model.query_embed(query))
sparse_vectors = next(bm25_embedding_model.query_embed(query))

In [16]:
# Create prefetch that will find candidate documents from hybrid search
prefetch = [
        models.Prefetch(
            query=dense_vectors,
            using="dense",
            limit=50,
        ),
        models.Prefetch(
            query=models.SparseVector(**sparse_vectors.as_object()),
            using="bm25",
            limit=50,
        ),
    ]

In [17]:
# Create user_id filter using the indexed field
user_filter = models.Filter(
    must=[
        models.FieldCondition(
            key="user_id",
            match=models.MatchValue(value=target_user_id)
        )
    ]
)

In [18]:
# Run hybrid search to get candidates
candidates = client.query_points(
    COLLECTION_NAME,
    prefetch=prefetch,
    query=dense_vectors,
    using="dense",
    with_payload=True,
    limit=500,  # Get decent volume of candidates for reranking
    query_filter=user_filter
)

print(f"Retrieved {len(candidates.points)} candidates from hybrid search")

Retrieved 71 candidates from hybrid search


## Reranking
The initial hybrid search retrieval casts a wide net quickly, then reranking applies sophisticated scoring to improve the final results. It's faster than running expensive models on our entire corpus, but more accurate than relying only on simple similarity.

Here, we first extract candidate documents (with metadata) from our hybrid search results. Then we rerank them using FastEmbed's reranking functionality.

**Note:** Not all rerankers create embeddings, but the ColBERT reranker used here does (at a token-level not document-level)

In [19]:
# Extract documents and their metadata for reranking
candidate_docs = []
candidate_metadata = []

for point in candidates.points:
    candidate_docs.append(point.payload.get('document', ''))
    candidate_metadata.append({
        'id': point.id,
        'score': point.score,
        'user_id': point.payload.get('user_id', ''),
        'document': point.payload.get('document', '')
    })

In [20]:
# Rerank the candidate documents from hybrid search
print("Reranking with ColBERT...")
ranker = Reranker('colbert')
reranked_results = ranker.rank(query=query, docs=candidate_docs)

# Get top 50 reranked results
top_results = reranked_results.top_k(50)

Reranking with ColBERT...
Loading default colbert model for language en
Default Model: colbert-ir/colbertv2.0
Loading ColBERTRanker model colbert-ir/colbertv2.0 (this message can be suppressed by setting verbose=0)
No device set
Using device cpu
No dtype set
Using dtype torch.float32
Loading model colbert-ir/colbertv2.0, this might take a while...
Linear Dim set to: 128 for downcasting


### Display Results

To better analyze our results, I've cleaned them up here into a pretty format from Polars.

In [21]:
# Combine reranking results with original metadata
final_results = []
for result in top_results:
    original_metadata = candidate_metadata[result.doc_id]
    final_results.append({
        'id': original_metadata['id'],
        'score': result.score,  # ColBERT reranking score
        'user_id': original_metadata['user_id'],
        'payload': original_metadata['document']
    })

df = pl.DataFrame(final_results)
pl.Config.set_fmt_str_lengths(200) # Show up to 200 characters from abstract
print(df)


shape: (50, 4)
┌──────┬──────────┬─────────┬──────────────────────────────────────────────────────────────────────┐
│ id   ┆ score    ┆ user_id ┆ payload                                                              │
│ ---  ┆ ---      ┆ ---     ┆ ---                                                                  │
│ i64  ┆ f64      ┆ str     ┆ str                                                                  │
╞══════╪══════════╪═════════╪══════════════════════════════════════════════════════════════════════╡
│ 4942 ┆ 0.711476 ┆ user_3  ┆ We study the globular cluster (GC) systems in three representative   │
│      ┆          ┆         ┆ fossil group galaxies: the nearest (NGC6482), the prototype          │
│      ┆          ┆         ┆ (NGC1132) and the most massive known to date (ESO306-017). This is   │
│      ┆          ┆         ┆ the …                                                                │
│ 162  ┆ 0.669475 ┆ user_3  ┆ Dwarf spheroidal galaxies are characterized by

## Conclusion and Takeaways

### What We've Accomplished

**Multi-Modal Embeddings**  
- Dense embeddings for semantic understanding  
- Sparse embeddings for keyword precision  
- Late interaction embeddings for fine-grained relevance  

**Production-Ready Architecture**  
- Scalable vector database with Qdrant  
- Efficient batch processing and indexing  
- Multi-tenant security with user filtering  

**Advanced Search Capabilities**  
- Hybrid retrieval combining multiple signals  
- Sophisticated ranking and reranking  
- Flexible query processing pipeline  

### Potential Enhancements

- Implement query-time filtering for better performance  
- Add caching for frequently accessed embeddings
- Parallelize long-running processes for faster execution
- Integrate with LLMs for a more conversation experience (RAG)

### Key Takeaways

1. **Hybrid approaches outperform single methods** by combining different strengths  
2. **Late interaction models** provide exceptional precision for text search  
3. **Vector databases** enable sophisticated multi-modal search at scale  
4. **Security considerations** are crucial for multi-tenant applications  

This hybrid search system provides a solid foundation for building production-grade search applications that deliver both high recall and precision across diverse query types.

Thanks for the fun project!
