### Homework: Search Evaluation

In this homework, we will evaluate the results of vector search.

It's possible that your answers will not match exactly. If that is the case, select the closest one.

### Required Libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

`pip install -U minsearch qdrant_client rouge`

minsearch should be at least 0.0.4

In [2]:
#%pip install -U minsearch qdrant_client rouge

In [1]:
import requests
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import minsearch
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from rouge import Rouge
from fastembed import TextEmbedding
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, CollectionStatus, CollectionInfo
from qdrant_client import models

In [2]:
print(minsearch.__version__)

0.0.4


### Evaluation Data

For this homework, we will use the same dataset we generated in the videos.

Let's get them:

In [3]:
url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database with unique IDs, and `ground_truth` contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [4]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Q1. Minsearch Text

Now let's evaluate our usual minsearch approach, indexing documents with:


In [6]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x738a52bc1820>

but tweak the parameters for search. Let's use the following boosting parameters:

`boost = {'question': 1.5, 'section': 0.1}

In [7]:
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [8]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [9]:
hit_rate(relevance_total), mrr(relevance_total)

(0.848714069591528, 0.7288235717887772)

What's the hitrate for this approach?

- 0.64
- 0.74
- 0.84
- 0.94

### A1. The Hit Rate using minsearch and boosting the question by 1.5 and the section by 0.1 is 0.84.

### Embeddings

The latest version of minsearch also supports vector search.  We will use it:

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

Let's create embeddings for the "question" field:

In [10]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

### Q2. Vector Search for Question

Now let's index these embeddings with minsearch:

In [11]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x738a51d37b00>

In [16]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    query_vec = pipeline.transform([q['question']])[0]
    results = vindex.search(query_vec, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

Evaluate this search method. What is the MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55

In [17]:
mrr(relevance_total)


0.3572833369353793

### A2. The MRR is 0.35

### Q3. Vector Search for Question and Answer

We only used question in Q2. We can use both question and anser:

In [18]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t) 

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

Using the same pipeline (`min_df=3` for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this approach

In [19]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x738a51367590>

In [24]:
print(df_ground_truth.columns)


Index(['question', 'course', 'document'], dtype='object')


In [25]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    query_text = q['question']
    query_vec = pipeline.transform([query_text])[0]
    results = vindex.search(query_vec, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

What is the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [26]:
hit_rate(relevance_total), mrr(relevance_total)

(0.8210503566025502, 0.6717347453353508)

### A3. The Hitrate is 0.82

### Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

In [None]:
client = QdrantClient(":memory:") 

In [6]:
model_handle = "jinaai/jina-embeddings-v2-small-en"
EMBEDDING_DIMENSIONALITY = 512
embedder = TextEmbedding(model_name=model_handle)

# Prepare the texts for embedding
texts = [doc['question'] + ' ' + doc['text'] for doc in documents]

# Use batching to avoid memory issues
batch_size = 8
embeddings = []
for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    batch_embeddings = list(embedder.embed(batch))
    embeddings.extend(batch_embeddings)

In [None]:
collection_name = "search-evaluation"
    
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

In [None]:
embedding = model.encode([doc['question'] + ' ' + doc['text']])[0]
points = []
for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
    point = PointStruct(
        id=i, # PointStruct requires a unique ID, using integer index here
        vector=embedding.tolist(),  # Convert numpy array to list
        payload={
            "question": doc['question'],
            "text": doc['text'],
            "course": doc['course'],
            "section": doc['section'],
            "id": doc['id']  # Keep original document ID in payload
        }
    )
    points.append(point)

In [None]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    query_text = q['question'] + ' ' + q['answer_orig']
    query_embedding = embed(query_text) 

    search_result = qdrant_client.search(
        collection_name="your_collection",
        query_vector=query_embedding,
        limit=5,
        filter={"must": [{"key": "course", "match": {"value": q["course"]}}]}
    )

    relevance = [point.payload['id'] == doc_id for point in search_result]
    relevance_total.append(relevance)

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95

In [None]:
mrr(relevance_total)

### A4. The MRR is 

### Q5. Cosine Similarity

In the second part of the module, we looked at evaluating the entire RAG approach.  In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

One of the ways of doing it is by using the cosine similarity. Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors. In a geometrical sense, it's the cosine of the angles between the vectors.  Look up "cosine similarity geometry" if you want to learn more about it.

For us, it means that we need two things:
- First, we normalize each of the vectors
- Then, we compute the dot product

So we get this:

```python
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)
```

For normalization, we first conmpute the vector norm (its length), and then divide the vector by it:

In [None]:
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm

Or we can simplify it:

In [None]:
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from [our gpt-4o-mini evaluations](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv):

In [None]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

When creating the embeddings, we will use a simple way - the same we used in the [Embeddings](#embeddings) section:

In [None]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

Let's fit the vectorizer on all the text data we have:

In [None]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

Now use the `transform` method of the pipeline to create the embeddings and calculate the cosine similarity between each pair.

This is how you do it:

- For each answer pair, compute
   - `v_llm` for the answer from the LLM
   - `v_orig` for the original answer
   - then compute the cosine between them
- At the end, take the average

### Q6. Rouge

An alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it: `pip install rouge`

(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

In [None]:
rouge_scorer = Rouge()

r = df_results.iloc[10]
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

There are three scores: `rouge-1`, `rouge-2`, and `rouge-l`, and precision, recall and F1 scores for each.

- `rouge-1` - the overlap of unigrams
- `rouge-2` - bigrams
- `rouge-l` - the longest common subsequence

For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe.  What's the average Rouge-1 F1?

- 0.25
- 0.35
- 0.45
- 0.55

### A6. The average Rouge-1 F1 is