## Vector Search

Traditional keyword search works by matching exact words. This works well when you know the precise keywords present in the data. But what happens when there are no keywords? What if you're searching through images, audio, video or code, or even cross-modally? Vector search retrieves information based on semantic similarity measured numerically between vectorized data representations (embeddings). It recognizes patterns and relationships between concepts, enabling search systems to retrieve the most relevant content, even when the phrasing differs, terminology varies, or no explicit keywords exist.

> Keywords search is precise while vector search (due to its embeddings) is flexible.

While traditional databases that serve as data stores, vector databases are more like search engines. They are designed to be scalable, always available, and capable of delivering high-speed search results even under heavy loads. Just as Google or Bing can handle billions of queries at once, vector databases are designed for scenarios where rapid, high-throughput, low-latency retrieval is a must.

### Vectors
Vectors (also known as embeddings) are high-dimensional representations of various data points — texts, images, videos, etc. Many state-of-the-art (SOTA) embedding models generate representations of over 1,500 dimensions. When it comes to state-of-the-art PDF retrieval, the representations can reach over 100,000 dimensions per page.

#### Properties of vectors
- Vectors are heavy
- They are obtained from some other source-of-truth data. (They are always a transformation of other data (text, video, image, audio))
- They are fixed-size
- The same embedding model should be used to maintain the geometry of the vector space
- To enjoy the benefits of vector data, we need to store it separately

Vector search relies on high-dimensional vector mathematics, making it computationally heavy at scale. A brute-force similarity search would require comparing a query against every vector in the database which is unfeasible for production scenarios where a db can have up to 100 million records.

Retrieval-Augmented Generation (RAG) and agentic RAG use vector databases as a knowledge source to retrieve context for large language models (LLMs). In the first part, vector search allows us to refine our choices based on similarity and dissimilarity rather than starting with a fixed query. This flexibility is possible because vector search is not tied to the binary “match/not match” concept but operates on distances in a vector space.

### Quadrant
Qdrant is an open-source vector search engine, a dedicated solution built in Rust for scalable vector search.

- To make production-level vector search at scale;
- To stay in sync with the latest trends and best practices;
- To fully use vector search capabilities (including those beyond simple similarity search).

#### Installing Quadrant

```bash
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
```

The second line in the docker run command mounts local storage to keep your data persistent. So even if you restart or delete the container, your data will still be stored locally.

- 6333 – REST API port
- 6334 – gRPC API port

Qdrant provides a built-in Web UI you can use it to inspect collections, check system health, and even run simple queries.
When you're running Qdrant in Docker, the Web UI is available at http://localhost:6333/dashboard

#### Installing Required Libraries
In the environment created specifically for this course, we’ll install:

- The ```qdrant-client package```. Qdrant offers official clients for Python, JavaScript/TypeScript, Go, and Rust.
- The ```fastembed package``` - an optimized embedding (data vectorization) solution designed specifically for Qdrant. Make sure you install version ```>= 1.14.2``` to use the local inference with Qdrant.

### References
- [vector search](https://qdrant.tech/articles/dedicated-vector-search/)
- [search](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/02-vector-search/sematic_search.ipynb)

In [2]:
!python -m pip install -q "qdrant-client[fastembed]>=1.14.2"

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


I am on codespaces so my port is forwarded to https://super-duper-doodle-55v9gv4wrv92v764-6333.app.github.dev/dashboard just add /dashboard to the forwarded port URL

In [4]:
# import required libraries
from qdrant_client import QdrantClient, models
import requests

In [5]:
# Initialise the client and connect to our local instance
client = QdrantClient("http://localhost:6333")

### Step 2: Study the Dataset
To build a working vector search solution (and, more generally, to understand if/when/how it’s needed), it's good to study the dataset and figure out the nature and structure of the data we’re working with, for example:

- modality — is it text, images, videos, a combination?
- specifics — if it’s text: language used, how big are the text pieces, are there any special characters, etc.

It will help us define:

- the right data "schema" (what to vectorize, what to store as metadata, etc);
- the right embedding model (the best fit based on the domain, precision & resource requirements).

We have a toy dataset provided for experimentation, let's check it out:

In [6]:
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

In [None]:
documents_raw

Data already seems cleaned and chunked (i.e., divided into small pieces that embedding models can easily digest), so what's left is to define:

- which fields could be used for semantic search ;
- which fields should be stored as metadata, e.g. useable for filtering conditions;

We have a dataset with three course types:
```data-engineering-zoomcamp```, ```machine-learning-zoomcamp```, and ```mlops-zoomcamp```.
Each course includes a collection of ```question``` and `text (answer)` pairs, along with the `section` the question refers to.

From the earlier step, we will store the `course` and `section` fields as metadata.
This way, we can filter search results when asking questions related to a specific course or a specific section.

### Step 3: Choosing the Embedding Model with FastEmbed
Now that we know we're embedding small chunks of English text (course-related question and answer pairs), we can choose a suitable embedding model to convert this data into vectors.

The choice of an embedding model depends on many factors:

- The task, data modality, and data specifics;
- The trade-off between search precision and resource usage (larger embeddings require more storage and memory);
- The cost of inference (especially if you're using a third-party provider);
etc

---

[FastEmbed](https://github.com/qdrant/fastembed) is an optimized embedding solution designed specifically for Qdrant. It delivers low-latency, CPU-friendly embedding generation, eliminating the need for heavy frameworks like PyTorch or TensorFlow. It uses quantized model weights and ONNX Runtime, making it significantly faster than traditional Sentence Transformers on CPU while maintaining competitive accuracy.

FastEmbed supports:

- Dense embeddings for text and images (the most common type in vector search, ones we're going to use today)
- Sparse embeddings (e.g., BM25 and sparse neural embeddings)
- Multivector embeddings (e.g., ColPali and ColBERT, late interaction models)
- Rerankers

All of these can be directly used in Qdrant (as Qdrant supports dense, sparse & multivectors along with hybrid search).
FastEmbed’s integration with Qdrant allows you to directly pass text or images to the Qdrant client for embedding.

### FastEmbed for Textual Data

In [None]:
from fastembed import TextEmbedding
TextEmbedding.list_supported_models()

In [None]:
# We need an embedding model suitable for English text.
# It also makes sense to select a unimodal model, since we’re not including images in our search
# Also model that produces small-to-moderate-sized embeddings (e.g., 512 dimensions), so we don’t overuse resources in our simple setup. 

import json

EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model["dim"] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent=2))

In [9]:
# jina-embeddings-v2-small-en has the best dim and match our preference
model_handle = "jinaai/jina-embeddings-v2-small-en"

Now we’re ready to configure and use Qdrant for semantic search. To fully understand what’s happening, here’s a quick overview of Qdrant’s core terminology:

- Points are the central entity Qdrant works with. A point is a record consisting of an ID, a vector, and an optional payload.
- A collection is a named set of points (i.e., vectors with optional payloads) that you can search within. Think of it as the container for your vector search solution, a single business problem solved.A collection is similar to index we used in elastic search ??

Qdrant supports different types of vectors to enable different modes of data exploration and search (dense, sparse, multivectors, and named vectors). In this example, we’ll use the most common type, dense vectors.

> Embeddings capture the semantic essence of the data, while the payload holds structured metadata.
> This metadata becomes especially useful when applying filters or sorting during search. Qdrant's payloads can hold structured data like booleans, keywords, geo-locations, arrays, and nested objects.

### Step 4: Create a Collection
When creating a collection, we need to specify:

- Name: A unique identifier for the collection.
- Vector Configuration:
  - Size: The dimensionality of the vectors.
  - Distance Metric: The method used to measure similarity between vectors.

There are additional parameters you can explore in our [documentationtext](https://qdrant.tech/documentation/concepts/collections/#create-a-collection)]. Moreover, you can configure other vector types in Qdrant beyond typical dense embeddings (e.g., for hybrid search). However, for this example, the simplest default configuration is sufficient.

In [10]:
# Define the collection name
collection_name = "zoomcamp-rag"

# Check if collection exists
if not client.collection_exists(collection_name):
    # Create the collection if it doesn't exist
    client.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
            distance=Distance.COSINE  # Distance metric for similarity search
        )
    )
    print(f"Collection '{collection_name}' created successfully")
else:
    print(f"Collection '{collection_name}' already exists")

Collection 'zoomcamp-rag' already exists


### Step 5: Create, Embed & Insert Points into the Collection

[Points](https://qdrant.tech/documentation/concepts/points/#points) are the core data entities in Qdrant. Each point consists of:

- ID. A unique identifier. Qdrant supports both 64-bit unsigned integers and UUIDs.
- Vector. The embedding that represents the data point in vector space.
- Payload (optional). Additional metadata as key-value pairs.

In [10]:
points = []
id = 0

for course in documents_raw:
    for doc in course['documents']:

        point = models.PointStruct(
            id=id,
            vector=models.Document(text=doc['text'], model=model_handle),
            payload={
                "text": doc['text'],
                "section": doc['section'],
                "course": course['course']
            } #save all needed metadata fields
        )
        points.append(point)

        id += 1

In [11]:
# embed and upload points to our collection.
# First, FastEmbed will fetch&download the selected model (path defaults to os.path.join(tempfile.gettempdir(), "fastembed_cache")), and perform inference directly on your machine.
# Then, the generated points will be upserted into the collection, and the vector index will be built.

client.upsert(
    collection_name=collection_name,
    points=points
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In addition to basic upsert, Qdrant supports batch upsert in both column- and record-oriented formats.

The Python client offers:
- Parallelization
- Retries
- Lazy batching
These can be configured via parameters in the upload_collection and upload_points functions.
For details, check the [documentation](https://qdrant.tech/documentation/concepts/points/#upload-points).

### Study Data Visually
Let’s explore the uploaded data in the Qdrant Web UI at http://localhost:6333/dashboard to study semantic similarity visually.

For example, using the `Visualize tab` in the zoomcamp-rag collection, we can view all answers to the course questions (948 points) and see how they group together by meaning, additionally coloured by the course type.

To do that, run the following command:

```bash
{
  "limit": 948,
  "color_by": {
    "payload": "course"
  }
}
```

This 2D representation is the result of dimensionality reduction applied to jina-embeddings.

### Step 6: Running a Similarity Search
#### How Similarity Search Works in Quadrant

- Qdrant compares the query vector to stored vectors (based on a vector index) using the distance metric defined when creating the collection.
- The closest matches are returned, ranked by similarity.

Vector index is built for approximate nearest neighbor (ANN) search, making large-scale vector search feasible.

In [2]:
def search(query, limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [11]:
# let's pick a random to ask. NB. we didnt upload the questions, only responses.
import random
course = random.choice(documents_raw)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent=2))

{
  "text": "We can use sklearn & numpy packages to calculate Root Mean Squared Error\nfrom sklearn.metrics import mean_squared_error\nimport numpy as np\nRmse = np.sqrt(mean_squared_error(y_pred, y_val/ytest)\nAdded by Radikal Lukafiardi\nYou can also refer to Alexey\u2019s notebook for Week 2:\nhttps://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/02-carprice.ipynb\nwhich includes the following code:\ndef rmse(y, y_pred):\nerror = y_pred - y\nmse = (error ** 2).mean()\nreturn np.sqrt(mse)\n(added by Rileen Sinha)",
  "section": "3. Machine Learning for Classification",
  "question": "How to calculate Root Mean Squared Error?"
}


In [12]:
result = search(course_piece['question'])
result

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

QueryResponse(points=[ScoredPoint(id=552, version=1, score=0.8832736, payload={'text': 'We can use sklearn & numpy packages to calculate Root Mean Squared Error\nfrom sklearn.metrics import mean_squared_error\nimport numpy as np\nRmse = np.sqrt(mean_squared_error(y_pred, y_val/ytest)\nAdded by Radikal Lukafiardi\nYou can also refer to Alexey’s notebook for Week 2:\nhttps://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/02-carprice.ipynb\nwhich includes the following code:\ndef rmse(y, y_pred):\nerror = y_pred - y\nmse = (error ** 2).mean()\nreturn np.sqrt(mse)\n(added by Rileen Sinha)', 'section': '3. Machine Learning for Classification', 'course': 'machine-learning-zoomcamp'}, vector=None, shard_key=None, order_value=None)])

`score` – the cosine similarity between the `question` and `text` embeddings.

Let’s compare the original and retrieved answers for our randomly selected question.

In [13]:
print(f"Question:\n{course_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}".format(course_piece['text']))

Question:
How to calculate Root Mean Squared Error?

Top Retrieved Answer:
We can use sklearn & numpy packages to calculate Root Mean Squared Error
from sklearn.metrics import mean_squared_error
import numpy as np
Rmse = np.sqrt(mean_squared_error(y_pred, y_val/ytest)
Added by Radikal Lukafiardi
You can also refer to Alexey’s notebook for Week 2:
https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/02-carprice.ipynb
which includes the following code:
def rmse(y, y_pred):
error = y_pred - y
mse = (error ** 2).mean()
return np.sqrt(mse)
(added by Rileen Sinha)

Original Answer:
We can use sklearn & numpy packages to calculate Root Mean Squared Error
from sklearn.metrics import mean_squared_error
import numpy as np
Rmse = np.sqrt(mean_squared_error(y_pred, y_val/ytest)
Added by Radikal Lukafiardi
You can also refer to Alexey’s notebook for Week 2:
https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/02-carprice.ipynb
which includ

In [14]:
print(search("What if I submit homeworks late?").points[0].payload['text'])

No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]


### Step 7: Running a Similarity Search with Filters
We can refine our search using metadata filters. This is ismilar to our toy search engine where we can set filters on programmes.

> Qdrant’s custom vector index implementation, Filterable HNSW, allows for precise and scalable vector search with filtering conditions.

For example, we can search for an answer to a question related to a specific course from the three available in the dataset.
Using a `must` filter ensures that all specified conditions are met for a data point to be included in the search results.

> Qdrant also supports other filter types such as `should`, `must_not`, `range`, and more. For a full overview, check the [Filtering Guide](https://qdrant.tech/articles/vector-search-filtering/)

To enable efficient filtering, we need to turn on [indexing of payload fields](https://qdrant.tech/documentation/concepts/indexing/#payload-index).

In [15]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

UpdateResult(operation_id=3, status=<UpdateStatus.COMPLETED: 'completed'>)

In [16]:
# update our searh function to allow filtering
def search_in_course(query, course="mlops-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # this allow us to filter by course name
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [17]:
them = ["data-engineering-zoomcamp", "machine-learning-zoomcamp", "mlops-zoomcamp"]

In [18]:
# Let’s see how the same question is answered across different courses:
for c in them:
    print(f"Results from {c}:")
    print(search_in_course("What if I submit homeworks late?", c).points[0].payload['text'] + "\n")

Results from data-engineering-zoomcamp:
No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]

Results from machine-learning-zoomcamp:
Depends on whether the form will still be open. If you're lucky and it's open, you can submit your homework and it will be evaluated. if closed - it's too late.
(Added by Rileen Sinha, based on answer by Alexey on Slack)

Results from mlops-zoomcamp:
Please choose the closest one to your answer. Also do not post your answer in the course slack channel.



It seems mlops doesnt have related responses.

### Further References

- [Vector Search Manual](https://qdrant.tech/articles/vector-search-manuals/): To dive deeper into efficient vector search setup (data prep, organization, and storage in a production-ready vector search solution)
- [Hybrid Search](https://qdrant.tech/articles/hybrid-search/): It combines the strengths of both keywords-based search and vector search