# Vector Search with Qrant

### Vector Search

Vector search is the backbone of the modern internet, whether you notice it or not. It powers recommendation engines, chatbots, AI agents, and even major search engines. 

In simple terms, traditional keyword search works by matching exact words. This works well when you know thr precise keywords present in the data. But what happens when there are no keywords? What if you're searching through images, audio, video or code, or even cross-modally ?

Moreover, even in text-heavy documents, keyword search struggles to capture context and meaning.  The same idea can be phrased in countless ways, so it is completely unfeasible to compare/search for using keyword-based methods.

Instead of relying on exact matches, vector search retieves information based on semantic similarity measured numerically between vectorized data representation (embeddings). It recognizes patterns and relationships between concepts, enabling search systems to retrieve the most relevant content, even when the phrasing differs, terminology varies, or no explicit keywords exist.

### Qdrant

Qdrant is an open-source vector search engine, a dedicated solution built in Rust for scalable vector search. If you're wondering why you might need a dedicated solution for vector search, we’ve addressed that in the article "Built for Vector Search".

To TLDR: 

* To make production-level vector search at scale. 

* To stay in sync with the latest trends and best pratices. 

* To fully use vector search capabilities (including those beyond simple similarity search).

####  0: Setup
Qdrant is fully open-source, which means you can run it in multiple ways depending on your needs.
You can self-host it on your own infrastructure, deploy it on Kubernetes, or run it in managed Cloud.

We're going to run a Qdrant instance in a Docker container.

##### Docker
All you need to do is pull the image and start the container using the following commands:

docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
The second line in the docker run command mounts local storage to keep your data persistent. So even if you restart or delete the container, your data will still be stored locally.

6333 – REST API port
6334 – gRPC API port
To help you explore your data visually, Qdrant provides a built-in Web UI, available in both Qdrant Cloud and local instances. You can use it to inspect collections, check system health, and even run simple queries.

When you're running Qdrant in Docker, the Web UI is available at http://localhost:6333/dashboard

#### 1. Import Required Libraries & Connect to Qdrant

In [1]:
from qdrant_client import QdrantClient, models

import json

  from .autonotebook import tqdm as notebook_tqdm


#### Initialize the client

In [8]:
client = QdrantClient(url="http://localhost:6333") # Connect to local Qdrant instance

#### 2. Load the Documents

In [3]:
with open("../documents.json", "r") as file:
    data = json.load(file)

#### 3. Embedding Model

In [4]:
from fastembed import TextEmbedding 

TextEmbedding.list_supported_models()

[{'model': 'BAAI/bge-base-en',
  'sources': {'hf': 'Qdrant/fast-bge-base-en',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.42,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model': 'BAAI/bge-base-en-v1.5',
  'sources': {'hf': 'qdrant/bge-base-en-v1.5-onnx-q',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.21,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model':

In [5]:
import json

EMBEDDING_DIMENSIONS = 512 # Dimension of the embedding vectors

for model in TextEmbedding.list_supported_models():
    if model['dim'] == EMBEDDING_DIMENSIONS:
        print(json.dumps(model, indent=2))

{
  "model": "BAAI/bge-small-zh-v1.5",
  "sources": {
    "hf": "Qdrant/bge-small-zh-v1.5",
    "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
    "_deprecated_tar_struct": true
  },
  "model_file": "model_optimized.onnx",
  "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
  "license": "mit",
  "size_in_GB": 0.09,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "Qdrant/clip-ViT-B-32-text",
  "sources": {
    "hf": "Qdrant/clip-ViT-B-32-text",
    "url": null,
    "_deprecated_tar_struct": false
  },
  "model_file": "model.onnx",
  "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
  "license": "mit",
  "size_in_GB": 0.25,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "jinaai/jina-embeddings-v2-small-e

In [6]:
EMBEDDING_MODEL ="jinaai/jina-embeddings-v2-small-en"

#### 4. Create a Collection

In [24]:
# Define the collection name
collection_name = "zoomcamp-rag-1"

In [25]:
# Create the collection with specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONS,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

True

#### 5. Create, Embed & Insert Points into the collection

In [26]:
points = [] 
id = 0 

for course in data:
    for doc in course['documents']:
        point = models.PointStruct(
            id=id,
            vector=models.Document(text=doc['text'], model=EMBEDDING_MODEL),
            payload={
                "text": doc['text'],
                "section": doc['section'],
                "course": course['course']
            } #save all needed metadata fields
        )
        points.append(point)

        # increment the id for the next point
        id += 1

In [27]:
points[:5]

[PointStruct(id=0, vector=Document(text="The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", model='jinaai/jina-embeddings-v2-small-en', options=None), payload={'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to registe

In [28]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

#### 6. Running a Similarity Search


In [29]:
def search(query, limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=EMBEDDING_MODEL # specify the embedding model to use 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [30]:
import random 

course = random.choice(data)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent=2))

{
  "text": "If you\u2019re using an Anaconda installation:\nCd home/\nConda install gcc\nSource back to your RisingWave Venv - source .venv/bin/activate\nPip install psycopg2-binary\nPip install -r requirements.txt\nFor some reason this worked - the Conda base doesn\u2019t have the GCC installed - (GNU Compiler Collection) a compiler system that supports various programming languages. Without this the it fails to install pyproject.toml-based projects\n\u201cIt's possible that in your specific environment, the gcc installation was required at the system level rather than within the virtual environment. This can happen if the build process for psycopg2 tries to access system-level dependencies during installation.\nInstalling gcc in your main Python installation (Conda) would make it available system-wide, allowing any Python environment to access it when necessary for building packages.\u201d\ngcc stands for GNU Compiler Collection. It is a compiler system developed by the GNU Project 

In [31]:
result = search(course_piece['question'], limit=3)

In [32]:
result

QueryResponse(points=[ScoredPoint(id=112, version=0, score=0.82644975, payload={'text': "Issue:\ne…\nSolution:\npip install psycopg2-binary\nIf you already have it, you might need to update it:\npip install psycopg2-binary --upgrade\nOther methods, if the above fails:\nif you are getting the “ ModuleNotFoundError: No module named 'psycopg2' “ error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e\nFirst uninstall the psycopg package\nThen update conda or pip\nThen install psycopg again using pip.\nif you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql", 'section': 'Module 1: Docker and Terraform', 'course': 'data-engineering-zoomcamp'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=423, version=0, score=0.82351214,

In [33]:
print(f"Question:\n{course_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}".format(course_piece['text']))

Question:
Psycopg2 - `Could not build wheels for psycopg2, which is required to install pyproject.toml-based projects`

Top Retrieved Answer:
Issue:
e…
Solution:
pip install psycopg2-binary
If you already have it, you might need to update it:
pip install psycopg2-binary --upgrade
Other methods, if the above fails:
if you are getting the “ ModuleNotFoundError: No module named 'psycopg2' “ error even after the above installation, then try updating conda using the command conda update -n base -c defaults conda. Or if you are using pip, then try updating it before installing the psycopg packages i.e
First uninstall the psycopg package
Then update conda or pip
Then install psycopg again using pip.
if you are still facing error with r pcycopg2 and showing pg_config not found then you will have to install postgresql. in MAC it is brew install postgresql

Original Answer:
If you’re using an Anaconda installation:
Cd home/
Conda install gcc
Source back to your RisingWave Venv - source .venv/bin

In [34]:
print(search("what if I submit homework late?", limit=3))

points=[ScoredPoint(id=15, version=0, score=0.90654105, payload={'text': 'No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]', 'section': 'General course-related questions', 'course': 'data-engineering-zoomcamp'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=2, version=0, score=0.87710583, payload={'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'course': 'data-engineering-zoomcamp'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=797, version=0, score=0.863876, payload={'text': "Depends on whether the form will still be open. If you're lucky and it's o

#### 7: Running a Similarity Search with Filters

We can refine our searcg using metadata filters. 



 To Enable efficient filtering, we need to turn on `indexing of payload fields`

In [35]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

Now let's update our search function

In [37]:
def search_in_course(query, course="mlops-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=EMBEDDING_MODEL # specify the embedding model to use
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Let’s see how the same question is answered across different courses:

data-engineering-zoomcamp, machine-learning-zoomcamp, and mlops-zoomcamp.

In [38]:
print(search_in_course("What if I submit homeworks late?", "mlops-zoomcamp").points[0].payload['text'])

Please choose the closest one to your answer. Also do not post your answer in the course slack channel.
