### Hybrid Search with Qdrant

- Vector search: A paradigm shift from traditional keyword-based search, enabling semantic understanding of data.
- Embeddings: Numerical representations of data (text, images, etc.) in high-dimensional space (e.g., 384 or 1536 dimensions).
    - They are generated by ML models (e.g., OpenAI's text-embedding-ada-002, BERT, or Sentence Transformers like "all-MiniLM-L6-v2").
    - Similar items cluster nearby in vector space.
- Vector Indexing: Used to optimize fast similarity search in high-dimensional spaces. It uses two approaches to indexing
    - Flat Indexing (Brute-force) – Exact but slow (O(n) complexity).
    - Approximate Nearest Neighbors (ANN) which trades accuracy for speed. 
        - HNSW (Hierarchical Navigable Small World): Graph-based, fast and accurate.*
        - IVF (Inverted File Index): Clustering-based, good for large datasets.
        - PQ (Product Quantization): Compresses vectors for memory efficiency.
- Similarity Metrics
    - Cosine Similarity: Measures angle between vectors (ignores magnitude). Best for text.
    - Euclidean Distance: Straight-line distance in vector space. Good for images.
    - Dot Product: Combines magnitude and direction. Used when vector norms matter.
- Retrieval Process: Query → Embedding → Nearest Neighbor Search → Ranked Results.


#### Hybrid search
- Combines vector search with traditional keyword search (BM25, TF-IDF) for better results. In this case, two things happen:
    - Vector search handles semantic meaning ("synonyms, paraphrases").
    - Keyword search handles exact matches ("product codes, names").
- Implementation of hybrid search
    - Reciprocal Rank Fusion (RRF): Merges rankings from both methods.
    - Weighted Scores: Assign weights (e.g., 0.7 vector + 0.3 keyword).

#### Our Implementation
For our context, we are using vector search and RAG to improve response. 
- Vector Search: Retrieves relevant context for LLMs.
- The workflow:
    - User query → embedded → vector search over knowledge base → Top-k results fed to LLM as context → LLM generates answer grounded in retrieved docs.
- There is always room for improvement to improve accuracy (a slower or more accurate model like cross-encoder) and using static embedding (against dynamic embeddings)


#### End-to-End Vector Search Pipeline
**Data Preparation**
- Chunk documents → Generate embeddings → Store in vector DB (Qdrant)

**Query Processing**
- Embed query → Search index → Hybrid ranking (optional) (Qdrant or elastic search)

**Result Delivery**
- Return nearest neighbors → Optionally rerank → Pass to LLM (RAG) (Gemini, OpenAI, etc)

In [3]:
# Semantic Search finds results matching meaning (not just keywords).
# Vector Search  searches using numerical vectors (embeddings) instead of keywords.
# Hybrid Search combines vector search (semantic) + keyword search (exact matches).

- Sparse Vectors are high-dimensional vectors where most values are zero (e.g., TF-IDF, BM25).
- Dense Vectors: Sparse = keyword-based; Dense = ML-generated (embeddings).

### When to Use Which?
- Use Vector Search when you need semantic similarity (e.g., RAG, recommendations) or the data is unstructured (text, images).
- Use Semantic Search when hybrid logic (keywords + vectors + business rules) is needed or one requires explainability (e.g., search engines showing why a result matched).

### Sparse vectors
Surprisingly, keyword-based search is also implemented as vector search, but these vectors are usually sparse. That means the majority of the dimensions of such a vector are just zeros. A non-zero value at a particular vector dimension indicates the presence of a term from the dictionary assigned to that position. In other words, in sparse vectors, we have a dictionary in which each word/phrase gets its unique position. Since vectors are sparse, the dictionary can theoretically grow indefinitely, as we can append a new term at the very end.

The fact of using a flexible dictionary, make the sparse vectors excel in exact matches, as they can cover texts that would be sets of random characters for the dense vectors - such as proper names or identifiers. Dense embedding models also have a dictionary, but once the model is trained, extending them is not that easy, and requires fine-tuning of the model. A typical user rarely goes that far.

### BM25
There are plenty of different options for creating sparse embeddings, but BM25 is an industry standard, and its most popular form comes from the 90s. It's a statistical model (no neural networks involved), which makes it really fast and lightweight. It's usually a solid baseline in search benchmarks so you should not ignore it.

BM25 stands for Best Matching 25, and it was just the 25th attempt to create a formula that calculates how relevant a particular document is, given a query. In general, BM25 is a ranking function that helps search engines determine how relevant a document is to a query by combining two key concepts: Term Frequency (TF) and Inverse Document Frequency (IDF). BM25 also incorporates document length normalization to prevent longer documents from having an unfair advantage simply due to their size.

- The Term Frequency component rewards documents that contain the query terms multiple times, but with diminishing returns - so a document with 10 occurrences of a word isn't necessarily 10 times better than one with just 1 occurrence.
- The Inverse Document Frequency part boosts the importance of rare words while reducing the weight of common words that appear in many documents, since rare terms are typically more informative for distinguishing relevant results.

In our case with Qdrant, we'll use an implementation available in FastEmbed. Let's start with the basics.

### Step 1: Connect to Qdrant

We already setup Qdrant. Just start docker and we are up
- ```docker ps -a```
- ```docker start ID```
- launch the forwarded port and add ```/dashboard``` at the end to get into the UI

In [5]:
# We are already setup considering the Qdrant server is already up initially
from qdrant_client import QdrantClient

client = QdrantClient("http://localhost:6333")
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='zoomcamp-rag'), CollectionDescription(name='zoomcamp-faq')])

### Step 2: Sparse vector search with BM25

In [6]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

In [8]:
#documents_raw[0]

In [9]:
from qdrant_client import models

In [10]:
# We need to create a collection first. Qdrant will handle the IDF calculations, if we configure it to. That's required for BM25, otherwise it won't boost the rare words.

# Create the collection with specified sparse vector parameters
client.create_collection(
    collection_name="zoomcamp-sparse",
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)

True

In [11]:
# FastEmbed comes with a BM25 implementation that we can use as any other model.

import uuid

# Send the points to the collection
client.upsert(
    collection_name="zoomcamp-sparse",
    points=[
        models.PointStruct(
            id=uuid.uuid4().hex,
            vector={
                "bm25": models.Document(
                    text=doc["text"], 
                    model="Qdrant/bm25",
                ),
            },
            payload={
                "text": doc["text"],
                "section": doc["section"],
                "course": course["course"],
            }
        )
        for course in documents_raw
        for doc in course["documents"]
    ]
)

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

arabic.txt: 0.00B [00:00, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

The upload operation was fast because BM25 is only a statistical model and does not require a neural network, so it is fast compared to dense embedding models.

### Step 3: Running sparse vector search with BM25
Right now, our vectors are ready to be searched over. Let's create a helper function.

In [12]:
def search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name="zoomcamp-sparse",
        query=models.Document(
            text=query,
            model="Qdrant/bm25",
        ),
        using="bm25",
        limit=limit,
        with_payload=True,
    )

    return results.points

In [13]:
results = search("Qdrant")
results

[]

Sparse vectors can return no results, if none of the keywords from the query were ever used in the documents. No matter if there are some synonyms. Terminology does matter.

In [14]:
results = search("pandas")
print(results[0].payload["text"])

You can use round() function or f-strings
round(number, 4)  - this will round number up to 4 decimal places
print(f'Average mark for the Homework is {avg:.3f}') - using F string
Also there is pandas.Series. round idf you need to round values in the whole Series
Please check the documentation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round
Added by Olga Rudakova


In [15]:
# Scores returned by BM25 are not calculated with cosine similarity, but with BM25 formula. They are not bounded to a specific range, but are virtually unbounded. 
# Let's see how they may look like. That's an important observation before we start implementing hybrid search.

results[0].score

6.0392046

#### Natural language like queries
Let's try again with a random question from our dataset to see how well sparse vector search can work with longer, natural language like queries

In [17]:
import random
import json

random.seed(202508)

course = random.choice(documents_raw)
course_piece = random.choice(course["documents"])
print(json.dumps(course_piece, indent=2))

{
  "text": "How to get classification metrics - precision, recall, f1 score, accuracy simultaneously\nUse classification_report from sklearn. For more info check here.\nAbhishek N",
  "section": "4. Evaluation Metrics for Classification",
  "question": "How to get all classification metrics?"
}


In [18]:
results = search(course_piece["question"])
print(results[0].payload["text"])

How to get classification metrics - precision, recall, f1 score, accuracy simultaneously
Use classification_report from sklearn. For more info check here.
Abhishek N


### Step 4: Qdrant Universal Query API - prefetching
Qdrant's `.query_points` method allows building multi-step search pipelines which can incorporate various methods into a single call. For example, we can retrieve some candidates with dense vector search, and then rerank them with sparse search, or use a fast method for initial retrieval and precise, but slow, reranking.

Let's create another collection that will keep both dense and sparse representations. Qdrant named vectors allow us to store multiple representations per point and it proves useful especially when we want to use mulitple models in our applications.

In [19]:
# Create the collection with both vector types
client.create_collection(
    collection_name="zoomcamp-sparse-and-dense",
    vectors_config={
        # Named dense vector for jinaai/jina-embeddings-v2-small-en
        "jina-small": models.VectorParams(
            size=512,
            distance=models.Distance.COSINE,
        ),
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)


True

In [20]:
# We have to upload all the vectors into the newly created collection.
client.upsert(
    collection_name="zoomcamp-sparse-and-dense",
    points=[
        models.PointStruct(
            id=uuid.uuid4().hex,
            vector={
                "jina-small": models.Document(
                    text=doc["text"],
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                "bm25": models.Document(
                    text=doc["text"], 
                    model="Qdrant/bm25",
                ),
            },
            payload={
                "text": doc["text"],
                "section": doc["section"],
                "course": course["course"],
            }
        )
        for course in documents_raw
        for doc in course["documents"]
    ]
)


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [21]:
def multi_stage_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name="zoomcamp-sparse-and-dense",
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-small",
                # Prefetch ten times more results, then
                # expected to return, so we can really rerank
                limit=(10 * limit),
            ),
        ],
        query=models.Document(
            text=query,
            model="Qdrant/bm25", 
        ),
        using="bm25",
        limit=limit,
        with_payload=True,
    )

    return results.points

In [22]:
print(json.dumps(course_piece, indent=2))

{
  "text": "How to get classification metrics - precision, recall, f1 score, accuracy simultaneously\nUse classification_report from sklearn. For more info check here.\nAbhishek N",
  "section": "4. Evaluation Metrics for Classification",
  "question": "How to get all classification metrics?"
}


In [23]:
results = multi_stage_search(course_piece["question"])
print(results[0].payload["text"])

How to get classification metrics - precision, recall, f1 score, accuracy simultaneously
Use classification_report from sklearn. For more info check here.
Abhishek N


### Step 5: Building Hybrid Search
In real production systems, you don't need to choose just one vector type. You never know what kind of queries your users will send to the system. E-commerce search might be just fine with lexical search on top of sparse vectors, as people will tend to send keywords, but in conversational systems, such as chatbots, natural language questions might be more frequent. Using one model as a retriever and another one as reranker is not the only way of how to use dense and sparse in a single system.

Hybrid Search is a technique for combining results coming from different search methods - for example dense and sparse. There isn't a clear definition of how exactly to implement it, as the main problem is how to mix results coming from methods which are incompatible. Dense and sparse search scores can't be compared directly, so we need another method that will order the final results somehow.

There are two terms important for Hybrid Search: fusion and reranking.

#### Fusion
Fusion is a set of methods which work on the scores/ranking as returned by the individual methods. There are various ways of how to achieve that, but Reciprocal Rank Fusion is the most popular technique. It is based on the rankings of the documents in each methods used, and these rankings are used to calculate the final scores. You will never calculate these scores, as Qdrant has some built-in capabilities that we will use.


In [24]:
def rrf_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name="zoomcamp-sparse-and-dense",
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-small",
                limit=(5 * limit),
            ),
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="Qdrant/bm25",
                ),
                using="bm25",
                limit=(5 * limit),
            ),
        ],
        # Fusion query enables fusion on the prefetched results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        with_payload=True,
    )

    return results.points

In [25]:
results = rrf_search(course_piece["question"])
print(json.dumps(course_piece, indent=2))
print(results[0].payload["text"])

{
  "text": "How to get classification metrics - precision, recall, f1 score, accuracy simultaneously\nUse classification_report from sklearn. For more info check here.\nAbhishek N",
  "section": "4. Evaluation Metrics for Classification",
  "question": "How to get all classification metrics?"
}
How to get classification metrics - precision, recall, f1 score, accuracy simultaneously
Use classification_report from sklearn. For more info check here.
Abhishek N


#### Reranking
Reranking is a broader term related to Hybrid Search. Fusion is one of the ways to rerank the results of multiple methods, but you can also apply a slower method that won't be effective enough to search over all the documents. But there is more to it. Business rules are often important for retrieval, as you prefer to show documents coming from the most recent news, for instance.

### Next steps
Dense and sparse vector search methods might not be enough in some cases, but both are fast enough to be used as initial retrievers. Plenty of more accurate yet slower methods exist, such as cross-encoders or multivector representations. These topics are definitely more advanced, and we won't cover them right now.

```The entire topics covered now (hybrid search especially) is useful especially for elearning systems where students ask questions to a knowledgebase of their class materials and the responses are sent to an LLM to make sense of the responses. This way the search system is more precise and relevant and the LLM is able to get more focused context.```