<h1> Vector Search with Qdrant : </h1>

<h2> Vector Search: </h2>
Vector search replaces exact keyword matching with semantic similarity, enabling retrieval based on meaning.
It excels in searching through diverse data types like text, images, audio, video, and code, even when phrasing or formats differ.
It does that my converting words into number(vector embeddings), and the distance between two vectors in a 3D space gives them the meaning - words related to each other based on similarity and context are closer to each other.  It recognizes patterns and relationships between concepts, enabling search systems to retrieve the most relevant content, even when the phrasing differs, terminology varies, or no explicit keywords exist.

<h2> Qdrant: </h2>
Qdrant is a high-performance, open-source vector search engine built in Rust for scalable, production-grade applications.
It offers advanced vector search capabilities beyond basic similarity, staying aligned with modern AI search trends.


<h4>TLDR:</h4>
Vector Search is excellent for semantic search to capture context and meaning behind words (and unstructured data) by relating language with maths(vectors and geometry), unlike syntactic search where strings are compared. and Qdrant is a vector search engine.


running qdrant in a docker container (in github codespace):
run in terminal:
<!-- 
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant -->

In [1]:
# qdrant-client package for python and fastembed for optimized embedding (data vectorization) designed for qdrant.

!python -m pip install -q "qdrant-client[fastembed]>=1.14.2"

<h3>Step 1: Import required libraries and connect to Qdrant.</h3>


In [2]:
from qdrant_client import QdrantClient, models

In [3]:
#connecting to local Qdrant instance
client = QdrantClient("http://localhost:6333") 

In [4]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

<h3>Step 2: Study the dataset</h3>


In [16]:
#documents_raw

As the data seems already cleaned and chunked (list of dictionaries of question-answer pairs), and is only English text. Next we decide on which fields to be used for vector search and which to be stored as metadata.
Metadata is useful for filtering conditions. 

Since we are building a Q&A RAG system, it makes sense to store Answers (text) as embeddings and use vector search using the Question as a query.
Filters like Course and Section could be stored as metadata.

<h3>Step 3: 
Choose the optimal embedding model with fastembed for our textual data</h3>


In [6]:
from fastembed import TextEmbedding
# TextEmbedding.list_supported_models()

In [7]:
import json

embedding_dimensionality = 512    # each vector will have a dimensionality of 512

for model in TextEmbedding.list_supported_models():
    if model['dim'] == embedding_dimensionality:
        print(json.dumps(model, indent = 2))

{
  "model": "BAAI/bge-small-zh-v1.5",
  "sources": {
    "hf": "Qdrant/bge-small-zh-v1.5",
    "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
    "_deprecated_tar_struct": true
  },
  "model_file": "model_optimized.onnx",
  "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
  "license": "mit",
  "size_in_GB": 0.09,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "Qdrant/clip-ViT-B-32-text",
  "sources": {
    "hf": "Qdrant/clip-ViT-B-32-text",
    "url": null,
    "_deprecated_tar_struct": false
  },
  "model_file": "model.onnx",
  "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
  "license": "mit",
  "size_in_GB": 0.25,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "jinaai/jina-embeddings-v2-small-e

In [8]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

# like most dense embedding model, this one also measures semantic closeness through cosine similarity - angle between 2 vectors

<h3>Step 4: Creating a collection</h3>



A collection is a named set of points that you can search within. A point is a record consisting of an ID, a vector, and an optional payload. 
Embeddings capture the semantic essence of the data, while the payload holds structured metadata.
This metadata becomes especially useful when applying filters or sorting during search.

When creating a collection, we need to specify:

1. Name: A unique identifier for the collection.
2. Vector Configuration:
    1. Size: The dimensionality of the vectors.
    2. Distance Metric: The method used to measure similarity between vectors.

In [9]:
collection_name = 'zoomcamp-rag'

client.create_collection(
    collection_name = collection_name,
    vectors_config = models.VectorParams(
        size = embedding_dimensionality,    # dimensionality of the vectors
        distance = models.Distance.COSINE   # distance metric for similarity search
    )
)

UnexpectedResponse: Unexpected Response: 409 (Conflict)
Raw response content:
b'{"status":{"error":"Wrong input: Collection `zoomcamp-rag` already exists!"},"time":0.002862137}'

<h3>Step 5: Create, Embed & Insert points into the collection.</h3>


Points are the core data entities in Qdrant. Each point consists of:

1. ID. A unique identifier. 
2. Vector. The embedding that represents the data point in vector space.
3. Payload (optional). Additional metadata as key-value pairs.

In [None]:
# creating points

points = []
id = 0

for course in documents_raw:
    for doc in course['documents']:

        point = models.PointStruct(
            id = id,
            vector = models.Document(text = doc['text'], model = model_handle),
            payload = {
                "text": doc['text'],
                "section": doc['section'],
                "course": course['course']
            }
        )
        points.append(point)

        id += 1

In [None]:
# embedding and uploading the points

# upsert is the combination of insert and update

client.upsert(
    collection_name = collection_name,
    points = points
)

<h3>Study the data visually:</h3>

Explore the uploaded data in the Qdrant Web UI at http://localhost:6333/dashboard to study semantic similarity visually.

For example, using the Visualize tab in the zoomcamp-rag collection, we can view all answers to the course questions (948 points) and see how they group together by meaning, additionally coloured by the course type.

To do that, run the following command:

In [None]:
# {
#   "limit": 948,
#   "color_by": {
#     "payload": "course"
#   }
# }

This 2D representation is the result of dimensionality reduction applied to jina-embeddings.


<h3> Step 6: Running a similarity search</h3>



<h4> how similarity search works:</h4>

1. Qdrant compares the query vector to the stored vector, based on vector index, using the distance metric defined for similarity.

2. The closest matches are returned based on similarity. Vector index is built for ANN (Approximate Nearest Neighbour) search, because building exact search is difficult for vector index since vectors can be really huge in size.
 

In [11]:
def search(query, limit=1):
    
    results = client.query_points(
        
        collection_name = collection_name,
        
        query = models.Document(    # embed the query text locally using "jinaai/jina-embeddings-v2-small-en"
            text = query,
            model = model_handle
        ),
        limit = limit,    # top closest matches
        with_payload = True    # to get metadata in results
    )

    return results

Now, picking a random question from the course data. 

Also remember, we didn't upload/embed the questions to Qdrant.

In [26]:
import random

course = random.choice(documents_raw)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent = 2))

{
  "section": "Module 4: Deployment",
  "question": "\u2018Invalid base64\u2019 error after running `aws kinesis put-record`"
}


In [27]:
result = search(course_piece['question'])
result



Now, comparing the retrieved answer with the original answer for our randomly selected question.

In [31]:
print(f"Question:\n{course_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}\n".format(course_piece['text']))

Question:
‘Invalid base64’ error after running `aws kinesis put-record`

Top Retrieved Answer:
Solution description: To get around this, pass the argument ‘--cli-binary-format raw-in-base64-out’. This will encode your data string into base64 before passing it to kinesis
Added by M

Original Answer:
Solution description: To get around this, pass the argument ‘--cli-binary-format raw-in-base64-out’. This will encode your data string into base64 before passing it to kinesis
Added by M



Now, searching a question that wasn't initially in the dataset.

In [32]:
print(search("What if i submit homework late?").points[0].payload['text'])

No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]


<h3> Step 7: Running a similarity search with filters</h3>

Using Qdrant's custom vector index implementation, Filterable HSNW, we can get precise and scalable vector search results with filtering conditions.
Using a 'must' filter ensures that all specified conditions are met for a data point to be included in the search results.
Qdrant also supports other filter types such as 'should', 'must_not', 'range', and more.

In [34]:
# to enable filtering, we need to turn on indexing of payload fields.

client.create_payload_index(
    collection_name = collection_name,
    field_name = 'course',
    field_schema = 'keyword'
)

UpdateResult(operation_id=3, status=<UpdateStatus.COMPLETED: 'completed'>)

In [35]:
# updating the search function

def search_in_course(query, limit=1, course):
    results = client.query_points(
        collection_name = collection_name,
        query = models.Document(      # embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text = query,
            model = model_handle 
        ),
        query_filter = models.Filter(   # filter by course name
            must = [
                 models.FieldCondition(
                     key = 'course',
                     match = models.MatchValue(value = course)
                 )   
            ]
        ),
        limit = limit,   # top closest matches
        with_payload = True    #to get metadata in the results
    )

    return results

In [38]:
print(search_in_course("What if i submit homework late?", course = 'machine-learning-zoomcamp').points[0].payload['text'])

Depends on whether the form will still be open. If you're lucky and it's open, you can submit your homework and it will be evaluated. if closed - it's too late.
(Added by Rileen Sinha, based on answer by Alexey on Slack)


<h3>Conclusion: </h3>

A simple semantic search using Qdrant on FAQ documents has been completed.

Next step is Hybrid Search - combining the strengths of keyword-based search and vector search.

In many real-world applications, they work hand-in-hand, balancing the precision of keywords with the flexibility of embeddings to deliver the best results.