## Module 2: Vector Searching

Vector search leverages machine learning to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation. Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor algorithms.

Vector search engines - known as vector databases, semantic, or cosine search - find the nearest neighbors to a given (vectorized) query. Where traditional search relies on mentions of keywords, lexical similarity, and the frequency of word occurrences, vector search engines use distances in the embedding space to represent similarity. Finding related data becomes searching for nearest neighbors of the query.

### Qdrant

Qdrant is an open-source vector search engine, a dedicates solution built in Rust for scalable vector search. 

### Install dependencies

In [7]:
from qdrant_client import QdrantClient, models
import requests
from fastembed import TextEmbedding
import json
import random


In [2]:
qdrant = QdrantClient("http://localhost:6333")

qdrant

<qdrant_client.qdrant_client.QdrantClient at 0x11913aea0>

### Import the dataset

The dataset has three course types:
- data-engineering-zoomcamp
- machine-learning-zoomcamp
- mlops-zoomcamp

Each course includes a collection of:
- *question:* student's question
- *text:* anwser to student's question
- *section:* course section

In [None]:
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'

response = requests.get(docs_url)
documents_raw = response.json()

documents_raw

### Transform documents into embeddings

As we are building a Q&A RAG system, the `question` and `text` fields will be converted to embeddings to find the most relevant answer to a given question.

The `course` and `section` fields can be stored as metadata to provide more context when the someone wants to ask question related to a specific course or a specific course's section.

#### How to choose the embedding model?

It depends on many factors:
- the task, data modality and specifications
- the trade-off between search precision and resource usage (larger embeddings requeri more storage and memory)
- the cost of inference third-party provider

FastEmbed is an optimized embedding solition designed for Qdrant. It supports:
- dense embeddings for text and images (the most common type in vector search)
- sparse embeddings
- multivector embeddings
- rerankers

FastEmbed's integration with Qdrant allows to directly pass text or images to the Qdrant client for embedding.

#### Find the most suitable embedding model

- Use a small embedding model (e.g. 515 dimensions) and suitable for english text.
- Unimodal model once we are not including images in the search, only text.

In [None]:
EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model["dim"] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent = 2))

embedding_model = "jinaai/jina-embeddings-v2-small-en"

### Create a Qdrant Collection

A *Collection* in Qdrant is a container with a set of data points that belongs to a specific domain/entity we want to search for. It has:
- **Name**: A unique identifier for the collection.
- **Vector size:** The dimensionality of the vector.
- **Distance metric:** The method used to measure similiarity between vectors.


In [None]:
collection_name = "zoomcamp-rag"

qdrant.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

### Create, Embed and Insert a Point into the Collection

A *Point* is the main entity of Qdrant platform. It represents the data point in a high-dimensional space enriched with:
- **Identifier:** A unique identifier for the data point.
- **Vector** (embeddings): It captures the semantic essence of the data.
- **Payload**: It is a JSON object with additioanl metadata. This metadata becomes especially useful when applying filters or sorting during searching.

Before uploading the Points in the Collection, each document is embedded according the selected model (`jinaai/jina-embeddings-v2-small-en`), using the FastEmbed library. The, the generated points will be inserted into the Collection.

In [12]:
points = []
index = 0


for course in documents_raw:
    for doc in course["documents"]:
        point = models.PointStruct(
            id=index,
            vector=models.Document(text=doc["text"], model=embedding_model),
            payload={
                "text": doc["text"],
                "section": doc["section"],
                "course": course["course"]
            }
        )
        points.append(point)
        index += 1

qdrant.upsert(
    collection_name=collection_name,
    points=points
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/model.onnx:   0%|          | 0.00/130M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

### Perform Semantic Search - Similarity Search

1. Qdrant compares the query's vector to stored vectors, using the distance metric defined previously. 
2. The closest matches are returned, ranked by similarity. 



In [6]:
def search(query, limit=1):
    return qdrant.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=embedding_model
        ),
        limit=limit,
        with_payload=True
    )

In [12]:
course = random.choice(documents_raw)
course_doc = random.choice(course["documents"])
random_question = course_doc["question"]

random_question

"Py4JJavaError - ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`"

In [18]:
search_result = search(query=random_question)
search_result

print(f"Question:\n{course_doc['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(search_result.points[0].payload['text']))
print("Original Answer:\n{}".format(course_doc['text']))

Question:
Py4JJavaError - ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`

Top Retrieved Answer:
You need to look for the Py4J file and note the version of the filename. Once you know the version, you can update the export command accordingly, this is how you check yours:
` ls ${SPARK_HOME}/python/lib/ ` and then you add it in the export command, mine was:
export PYTHONPATH=”${SPARK_HOME}/python/lib/Py4J-0.10.9.5-src.zip:${PYTHONPATH}”
Make sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of py4j or you will encounter `ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`.
For instance, if the file under `${SPARK_HOME}/python/lib/` was `py4j-0.10.9.3-src.zip`.
Then the export PYTHONPATH statement above should be changed to `export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH"` appropriately.
Additionally, you can check for the version of ‘py4j’ of the spark you’re using from here

In [20]:
search_results = search(query="What if I submit homeworks late?")
answer = search_results.points[0].payload["text"]

answer

'No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]'

### Perform Semantic Search - Similarity Search with Filters

Qdrant's custom vector index implementation allows for precise and scalable vector search with filtering conditions.

For example, we can search for an answer related to a specific course. Using the `must` filter ensures that all specified conditions are met for a data point. 

Qdrant also supports other filter types such as `should`, `must_not`, `range`, and more.

To enable efficient filtering, we need to activate the indexing of payload fields.


In [21]:
qdrant.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [22]:
def search_by_filter(query, filter_key, filter_value, limit = 1):
    return qdrant.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=embedding_model
        ),
        query_filter=models.Filter(
            must=[
                models.FieldCondition(
                    key=filter_key,
                    match=models.MatchValue(value=filter_value)
                )
            ]
        ),
        limit=limit,
        with_payload=True
    )

In [25]:
search_filter_results = search_by_filter(query="What if I submit homeworks late?", filter_key="course", filter_value="mlops-zoomcamp")

answer_filtered = search_filter_results.points[0].payload["text"]

answer_filtered


'Please choose the closest one to your answer. Also do not post your answer in the course slack channel.'