<h1> Hybrid search with Qdrant </h1>
Vector search based on dense embeddings captures the semantics of the data, so you don't have to use the same terms in queries and documents to still be able to find the relevant items. However, historically we were also using some other methods which rely on the presence of the keywords. Methods such as Bag-of-Words, TFIDF and BM25 are still useful and in some cases should be preferred over the dense embeddings.


<h2>Sparse vectors</h2>

Surprisingly, keyword-based search is also implemented as vector search, but these vectors are usually sparse. That means the majority of the dimensions of such a vector are just zeros. A non-zero value at a particular vector dimension indicates the presence of a term from the dictionary assigned to that position. In other words, in sparse vectors, we have a dictionary in which each word/phrase gets its unique position. Since vectors are sparse, the dictionary can theoretically grow indefinitely, as we can append a new term at the very end.

The fact of using a flexible dictionary, make the sparse vectors excel in exact matches, as they can cover texts that would be sets of random characters for the dense vectors - such as proper names or identifiers. Dense embedding models also have a dictionary, but once the model is trained, extending them is not that easy, and requires fine-tuning of the model. A typical user rarely goes that far.

<h2>BM25</h2>
There are plenty of different options for creating sparse embeddings, but BM25 is an industry standard, and its most popular form comes from the 90s. It's a statistical model (no neural networks involved), which makes it really fast and lightweight. It's usually a solid baseline in search benchmarks so you should not ignore it.
<br>
<br>
BM25 stands for Best Matching 25, and it was just the 25th attempt to create a formula that calculates how relevant a particular document is, given a query. In general, BM25 is a ranking function that helps search engines determine how relevant a document is to a query by combining two key concepts: Term Frequency (TF) and Inverse Document Frequency (IDF).

1. The Term Frequency component rewards documents that contain the query terms multiple times, but with diminishing returns - so a document with 10 occurrences of a word isn't necessarily 10 times better than one with just 1 occurrence.

   
2. The Inverse Document Frequency part boosts the importance of rare words while reducing the weight of common words that appear in many documents, since rare terms are typically more informative for distinguishing relevant results.
BM25 also incorporates document length normalization to prevent longer documents from having an unfair advantage simply due to their size.

<h2> Step 1 : Connecting to Qdrant</h2>
Test if that was successful by listing all the collections.

In [6]:
from qdrant_client import QdrantClient

client = QdrantClient("http://localhost:6333")
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='zoomcamp-sparse'), CollectionDescription(name='zoomcamp-sparse-and-dense'), CollectionDescription(name='zoomcamp-rag')])

<h2>Step 2: Sparse vector search with BM25</h2>
We are going to use the same dataset as before. Let's download it and load into Qdrant, but this time we are going to create sparse vectors with BM25 only.

In [7]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

In [8]:
#documents[o]

# uncommenting it will print the really big document and will make it really long to scroll on github.

We need to create a collection first. Qdrant will handle the IDF calculations, if we configure it to. That's required for BM25, otherwise it won't boost the rare words.

In [9]:
from qdrant_client import models

# creating the collection with sparse vector parameters
client.create_collection (
    collection_name = "zoomcamp-sparse",
    sparse_vectors_config = {
        "bm25": models.SparseVectorParams (
            modifier = models.Modifier.IDF,
        )
    }
)

UnexpectedResponse: Unexpected Response: 409 (Conflict)
Raw response content:
b'{"status":{"error":"Wrong input: Collection `zoomcamp-sparse` already exists!"},"time":0.00012313}'

FastEmbed comes with a BM25 implementation that we can use as any other model.

In [10]:
import uuid

# sending the points to the collection
client.upsert(
    collection_name = "zoomcamp-sparse",
    points = [
        models.PointStruct(
            id = uuid.uuid4().hex,
                vector = {
                    "bm25":models.Document( text = doc['text'], model = "Qdrant/bm25", ),   
                },
                payload = {
                    "text": doc['text'],
                    "section": doc['section'],
                    "course": course['course']
                }
            )
            for course in documents_raw
            for doc in course['documents']
    ]
)

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

arabic.txt: 0.00B [00:00, ?B/s]

chinese.txt: 0.00B [00:00, ?B/s]

basque.txt: 0.00B [00:00, ?B/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

azerbaijani.txt:   0%|          | 0.00/967 [00:00<?, ?B/s]

bengali.txt: 0.00B [00:00, ?B/s]

catalan.txt: 0.00B [00:00, ?B/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

hebrew.txt: 0.00B [00:00, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

hinglish.txt: 0.00B [00:00, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

nepali.txt: 0.00B [00:00, ?B/s]

indonesian.txt: 0.00B [00:00, ?B/s]

kazakh.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

slovene.txt: 0.00B [00:00, ?B/s]

tajik.txt: 0.00B [00:00, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

UpdateResult(operation_id=3, status=<UpdateStatus.COMPLETED: 'completed'>)

 The upload operation was fast, since BM25 does not require a neural network. So, it is fast compared to dense embedding models.

<h1>Step 3: Running sparse vector search with BM25</h1>
Right now, our vectors are ready to be searched over. Let's create a helper function.

In [11]:
def search (query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name = 'zoomcamp-sparse',
        query = models.Document(
            text = query,
            model = "Qdrant/bm25",
        ),
        using = "bm25",
        limit = limit,
        with_payload = True,
    )

    return results.points

In [12]:
results = search("Qdrant")
results

[]

Sparse vectors can return no results, if none of the keywords from the query were ever used in the documents. No matter if there are some synonyms. Terminology does matter.

In [13]:
results = search("pandas")
print(results[0].payload["text"])

You can use round() function or f-strings
round(number, 4)  - this will round number up to 4 decimal places
print(f'Average mark for the Homework is {avg:.3f}') - using F string
Also there is pandas.Series. round idf you need to round values in the whole Series
Please check the documentation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round
Added by Olga Rudakova


Scores returned by BM25 are not calculated with cosine similarity, but with BM25 formula. They are not bounded to a specific range, but are virtually unbounded. Let's see how they may look like.

In [14]:
results[0].score

6.0538354

That's an important observation before we start implementing hybrid search.

<h2>Natural language like queries</h2>
Let's try again with a random question from our dataset to see how well sparse vector search can work with longer, natural language like queries.

In [15]:
import random
import json

random.seed(202506)

course = random.choice(documents_raw)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent = 2))

{
  "text": "Even though the upload works using aws cli and boto3 in Jupyter notebook.\nSolution set the AWS_PROFILE environment variable (the default profile is called default)",
  "section": "Module 4: Deployment",
  "question": "Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\""
}


In [16]:
results = search(course_piece['question'])
print(results[0].payload["text"])

The trial dbt account provides access to dbt API. Job will still be needed to be added manually. Airflow will run the job using a python operator calling the API. You will need to provide api key, job id, etc. (be careful not committing it to Github).
Detailed explanation here: https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment
Source code example here: https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c3051ed/dags/examples/dbt_cloud_example.py


<h2>Step 4: Qdrant Universal Query API - prefetching</h2>
Qdrant's .query_points method allows building multi-step search pipelines which can incorporate various methods into a single call. For example, we can retrieve some candidates with dense vector search, and then rerank them with sparse search, or use a fast method for initial retrieval and precise, but slow, reranking.


ascii
┌─────────────┐           ┌─────────────┐
│             │           │             │
│  Retrieval  │ ────────► │  Reranking  │
│             │           │             │
└─────────────┘           └─────────────┘

Let's create another collection that will keep both dense and sparse representations. Qdrant named vectors allow us to store multiple representations per point and it proves useful especially when we want to use mulitple models in our applications.

In [17]:
client.create_collection(
    collection_name = "zoomcamp-sparse-and-dense",
    vectors_config = {
        "jina-small": models.VectorParams( size = 512, distance = models.Distance.COSINE,),
    },
    sparse_vectors_config = {
        "bm25": models.SparseVectorParams( modifier = models.Modifier.IDF,)
    }
)

UnexpectedResponse: Unexpected Response: 409 (Conflict)
Raw response content:
b'{"status":{"error":"Wrong input: Collection `zoomcamp-sparse-and-dense` already exists!"},"time":0.000224579}'

Uploading all the vectors into the newly created collection.

In [18]:
client.upsert(
    collection_name = "zoomcamp-sparse-and-dense",
    points = [
        models.PointStruct(
            id = uuid.uuid4().hex,
            vector = {
                "jina-small": models.Document(
                    text = doc['text'],
                    model = "jinaai/jina-embeddings-v2-small-en",
                ),
                "bm25": models.Document(
                    text = doc['text'],
                    model = "Qdrant/bm25",
                ),  
            },
            payload = {
                "text": doc["text"],
                "section": doc["section"],
                "course": course["course"],
            }
        )
        for course in documents_raw
        for doc in course["documents"]
    ]
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [32]:
def multi_stage_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name = "zoomcamp-sparse-and-dense",
        # retriving contextually similar matches using dense embeddings 
        prefetch = [
            models.Prefetch(
                query = models.Document(
                    text = query,
                    model = "jinaai/jina-embeddings-v2-small-en",
                ),
                using = "jina-small",
                # Prefetch ten times more results, then expected to return, so we can really rerank them
                limit = (10 * limit),
            ),
        ],
        # searching through the retrieved results using sparse keyword search
        query = models.Document(
            text = query,
            model = "Qdrant/bm25",
        ),
        using = "bm25",
        limit = limit,
        with_payload = True,
    )

    return results.points

In [33]:
print(json.dumps(course_piece, indent = 2))

{
  "text": "Even though the upload works using aws cli and boto3 in Jupyter notebook.\nSolution set the AWS_PROFILE environment variable (the default profile is called default)",
  "section": "Module 4: Deployment",
  "question": "Uploading to s3 fails with An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.\""
}


In [34]:
results = multi_stage_search(course_piece["question"])
print(results[0].payload["text"])

Problem description. How can we connect s3 bucket to MLFLOW?
Solution: Use boto3 and AWS CLI to store access keys. The access keys are what will be used by boto3 (AWS' Python API tool) to connect with the AWS servers. If there are no Access Keys how can they make sure that they have the right to access this Bucket? Maybe you're a malicious actor (Hacker for ex). The keys must be present for boto3 to talk to the AWS servers and they will provide access to the Bucket if you possess the right permissions. You can always set the Bucket as public so anyone can access it, now you don't need access keys because AWS won't care.
Read more here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
Added by Akshit Miglani
