### Homework: Search Evaluation

In this homework, we will evaluate the results of vector search.

It's possible that your answers will not match exactly. If that is the case, select the closest one.

### Required Libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

`pip install -U minsearch qdrant_client rouge`

minsearch should be at least 0.0.4

In [1]:
%pip install -U minsearch qdrant_client rouge

Note: you may need to restart the kernel to use updated packages.


In [5]:
import requests
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import minsearch

### Evaluation Data

For this homework, we will use the same dataset we generated in the videos.

Let's get them:

In [3]:
url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database with unique IDs, and `ground_truth` contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [4]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Q1. Minsearch Text

Now let's evaluate our usual minsearch approach, indexing documents with:


In [6]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x745d9f3177d0>

but tweak the parameters for search. Let's use the following boosting parameters:

`boost = {'question': 1.5, 'section': 0.1}

In [7]:
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [8]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [9]:
hit_rate(relevance_total), mrr(relevance_total)

(0.848714069591528, 0.7288235717887772)

What's the hitrate for this approach?

- 0.64
- 0.74
- 0.84
- 0.94

### A1. The Hit Rate using minsearch and boosting the question by 1.5 and the section by 0.1 is 0.84.

### Embeddings

The latest version of minsearch also supports vector search.  We will use it:

In [10]:
print(minsearch.__version__)

0.0.4


In [12]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

Let's create embeddings for the "question" field:

In [13]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

### Q2. Vector Search for Question

Now let's index these embeddings with minsearch:

In [14]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x745d9d71fd10>

In [None]:
# Search with a query vector
query_vector = np.random.rand(768)  # 768-dimensional query vector
filter_dict = {"category": "programming", "level": "beginner"}

results = index.search(query_vector, filter_dict=filter_dict, num_results=5)

In [None]:
mrr()