## Homework: Search Evaluation

In this homework, we will evaluate the results of vector
search.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Required libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

```bash
pip install -U minsearch qdrant_client
``` 

minsearch should be at least 0.0.4.



## Evaluation data

For this homework, we will use the same dataset we generated
in the videos.

Let's get them:

In [8]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database
with unique IDs, and `ground_truth` contains generated
question-answer pairs. 

Also, we will need the code for evaluating retrieval:

In [14]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text

Now let's evaluate our usual minsearch approach, but tweak
the parameters. Let's use the following boosting 
params:

In [3]:
boost = {'question': 1.5, 'section': 0.1}

What's the hitrate for this approach?

* 0.64
* 0.74
* 0.84
* 0.94

In [4]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x72ec4871fa10>

In [5]:
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [6]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [7]:
hit_rate(relevance_total)

0.848714069591528

## Q1 ANSWER
c) 0.84

## Embeddings 

The latest version of minsearch also supports vector search. 
We will use it:


In [8]:
from minsearch import VectorSearch

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

Let's create embeddings for the "question" field:

In [10]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [11]:
texts


['Course - When will the course start?',
 'Course - What are the prerequisites for this course?',
 'Course - Can I still join the course after the start date?',
 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'Course - What can I do before the course starts?',
 'Course - how many Zoomcamps in a year?',
 'Course - Is the current cohort going to be different from the previous cohort?',
 'Course - Can I follow the course after it finishes?',
 'Course - Can I get support if I take the course in the self-paced mode?',
 'Course - Which playlist on YouTube should I refer to?',
 'Course - \u200b\u200bHow many hours per week am I expected to spend on this  course?',
 'Certificate - Can I follow the course in a self-paced mode and get a certificate?',
 'Office Hours - What is the video/zoom link to the stream for the “Office Hour” or workshop sessions?',
 'Office Hours - I can’t attend the “Office hours” / workshop, will it 

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

In [12]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x72ec165e11f0>

Evaluate this seach method. What's MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55

In [13]:
def vector_search(query, course):


    results = vindex.search(
        query_vector=query,
        filter_dict={'course': course},
        num_results=5
    )

    return results

In [14]:
q2_relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    query_vector = pipeline.transform([q['question']])
    results = vector_search(query=query_vector, course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    q2_relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [15]:
mrr(q2_relevance_total)

0.3571284489590088

## Q2 ANSWER
b) 0.35

## Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

In [16]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

Using the same pipeline (`min_df=3 for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this
approach

What's the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [17]:

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [18]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x72ec177b82c0>

In [19]:
q3_relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    query_vector = pipeline.transform([q['question']])
    results = vector_search(query=query_vector, course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    q3_relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [20]:
hit_rate(q3_relevance_total)

0.8210503566025502

## Q3 ANSWER
c) 0.82

## Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95

In [2]:
from qdrant_client import QdrantClient, models
client = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance

In [22]:
!pip install fastembed --upgrade


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [3]:
from fastembed import TextEmbedding

In [4]:
import json

EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model["dim"] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent=3))

{
   "model": "BAAI/bge-small-zh-v1.5",
   "sources": {
      "hf": "Qdrant/bge-small-zh-v1.5",
      "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
      "_deprecated_tar_struct": true
   },
   "model_file": "model_optimized.onnx",
   "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
   "license": "mit",
   "size_in_GB": 0.09,
   "additional_files": [],
   "dim": 512,
   "tasks": {}
}
{
   "model": "Qdrant/clip-ViT-B-32-text",
   "sources": {
      "hf": "Qdrant/clip-ViT-B-32-text",
      "url": null,
      "_deprecated_tar_struct": false
   },
   "model_file": "model.onnx",
   "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
   "license": "mit",
   "size_in_GB": 0.25,
   "additional_files": [],
   "dim": 512,
   "tasks": {}
}
{
   "model": "

In [5]:
EMBEDDING_DIMENSIONALITY = 512
#model_handle = TextEmbedding(model_name="jinaai/jina-embeddings-v2-small-en")
model_handle = "jinaai/jina-embeddings-v2-small-en"

In [6]:
# Define the collection name
collection_name = "homework3"
if client.collection_exists(collection_name):
    client.delete_collection(collection_name)                               
# Create the collection with specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

True

In [9]:
points = []
id = 0


for doc in documents:
    print(doc['course'])
    text = doc['question'] + ' ' + doc['text']
    point = models.PointStruct(
        id=id,
        vector=models.Document(text=text, model="jinaai/jina-embeddings-v2-small-en"), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
        payload={
            "text": doc['text'],
            "section": doc['section'],
            "course": doc['course'],
            "doc_id": doc['id'],
        } #save all needed metadata fields
    )
    points.append(point)

    id += 1

data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-engineering-zoomcamp
data-enginee

In [10]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [11]:
def search(query, limit=5):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( 
            text=query,
            model="jinaai/jina-embeddings-v2-small-en"
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [None]:
q4_relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    print(doc_id)
    results = search(query=q['question'])
    print(results)
    relevance = [d.payload["doc_id"] == doc_id for d in results.points]
    q4_relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

c02e79ef
points=[ScoredPoint(id=450, version=0, score=0.8800473, payload={'text': 'The course is available in the self-paced mode too, so you can go through the materials at any time. But if you want to do it as a cohort with other students, the next iterations will happen in September 2023, September 2024 (and potentially other Septembers as well).', 'section': 'General course-related questions', 'course': 'machine-learning-zoomcamp', 'doc_id': '636f55d5'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=0, version=0, score=0.87771636, payload={'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in Data

TypeError: tuple indices must be integers or slices, not str

In [None]:
results.points[0].payload["doc_id"]

'636f55d5'