In [1]:
## Homework: Search Evaluation

## Homework: Search Evaluation

In this homework, we will evaluate the results of vector
search.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Required libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

```bash
pip install -U minsearch qdrant_client
```

minsearch should be at least 0.0.4.

In [2]:
# !pip install -U minsearch qdrant_client

## Evaluation data

For this homework, we will use the same dataset we generated
in the videos.

Let's get them:

In [3]:
import requests
import numpy as np
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database
with unique IDs, and `ground_truth` contains generated
question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [4]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        # print("q:", q)
        doc_id = q['document']
        # print("doc_id:", doc_id)
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text

Now let's evaluate our usual minsearch approach, indexing documents with:

In [5]:
text_fields=["question", "section", "text"],
keyword_fields=["course", "id"]

but tweak the parameters for search. Let's use the following boosting params:

In [6]:
boost = {'question': 1.5, 'section': 0.1}

What's the hitrate for this approach?

* 0.64
* 0.74
* 0.84
* 0.94


In [7]:
documents[:3]

[{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp',
  'id': 'c02e79ef'},
 {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp',
  'id': '1f6520ca'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware

In [8]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "section", "text"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x719640005d80>

In [9]:
query = "I just discovered the course. Can I still join?"

In [10]:

def minsearch_search(query):
    # print("minsearch_search, query:", query)
    results = index.search(
        query=query['question'],
        filter_dict={'course': query['course']},
        boost_dict={'question': 1.5, 'section': 0.1},
        num_results=5
    )
    return results

In [11]:
evaluate(ground_truth, minsearch_search)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.848714069591528, 'mrr': 0.7288235717887772}

In [12]:
ground_truth[0:3]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What is the link for course registration?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

## Embeddings


The latest version of minsearch also supports vector search.
We will use it:

In [13]:
from minsearch import VectorSearch

We will also use TF-IDF and Singular Value Decomposition to
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

Let's create embeddings for the "question" field:

In [15]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

In [16]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x71961add9600>

Evaluate this seach method. What's MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55

In [17]:
def minsearch_vec_search(query):
    # print("minsearch_vec_search, query:", query)
    question = query['question']
    # print("question:", question)
    query_vec = pipeline.transform([question])
    # print("query_vec:", query_vec)
    results = vindex.search(
        query_vec,
        filter_dict={'course': query['course']},
        num_results=5
    )
    return results

In [18]:
evaluate(ground_truth, minsearch_vec_search)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.48195374972984656, 'mrr': 0.3573085512571141}

## Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

In [19]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

Using the same pipeline (`min_df=3` for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this
approach

What's the hitrate?

- 0.62
- 0.72
- 0.82
- **0.92**

In [20]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [21]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x71961894ebc0>

In [22]:
def minsearch_vec_search(query):
    question = query['question']
    query_vec = pipeline.transform([question])
    results = vindex.search(
        query_vec,
        filter_dict={'course': query['course']},
        num_results=5
    )
    return results

In [23]:
evaluate(ground_truth, minsearch_vec_search)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8210503566025502, 'mrr': 0.6717347453353508}

## Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95


In [24]:
!pip install -q "qdrant-client[fastembed]>=1.14.2"

```bash
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 -v "$(pwd)/qdrant_storage:/qdrant/storage:z" qdrant/qdrant
```

In [25]:
from qdrant_client import QdrantClient, models

In [26]:
qd_client = QdrantClient("http://localhost:6333")

In [27]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

In [28]:
EMBEDDING_DIMENSIONALITY = 512

In [29]:
limit = 5

In [30]:
collection_name = "zoomcamp-faq"
qd_client.delete_collection(collection_name=collection_name)

True

In [31]:
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

True

In [32]:
qd_client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [33]:
points = []

for i, doc in enumerate(documents):
    text = doc['question'] + ' ' + doc['text']
    vector = models.Document(text=text, model=model_handle)
    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

In [34]:
qd_client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [35]:
def qdrant_vector_search(query):
    question = query['question']
    course = query['course']

    query_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=question,
            model=model_handle 
        ),
        query_filter=models.Filter( 
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=5,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results


In [36]:
question = 'I just discovered the course. Can I still join it?'
qdrant_vector_search({
    "question": question,
    "course": "data-engineering-zoomcamp"
})

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp',
  'id': 'a482086d'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and

In [37]:
evaluate(ground_truth, qdrant_vector_search)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.9299762264966501, 'mrr': 0.8517722066133576}

## Q5. Cosine simiarity

n the second part of the module, we looked at evaluating
the entire RAG approach. In particular, we looked at 
comparing the answer generated by our system with the actual
answer from the FAQ.

One of the ways of doing it is using the cosine similarity. 
Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors.
In geometrical sense, it's the cosine of the angle between
the vectors. Look up "cosine similarity geometry" if you want to
learn more about it.

For us, it means that we need two things:

- First, we normalize each of the vectors
- Then, compute the dot product

So, we get this:

In [38]:
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)

For normalization, we first compute the vector norm (its length), and then divide the vector by it:

In [39]:
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm

Or we can simplify it:

In [40]:
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from [our gpt-4o-mini evaluations](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv):

In [41]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)
df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp



When creating embeddings, we will use a simple way -
the same we used in the [Embeddings](#embeddings) section:

In [42]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

Let's fit the vectorizer on all the text data we have:

In [43]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


What's the average cosine?

- 0.64
- 0.74
- 0.84
- 0.94

This is how you do it:

- For each answer pair, compute
    - `v_llm` for the answer from the LLM 
    - `v_orig` for the original answer
    - then compute the cosine between them
- At the end, take the average

In [44]:
pipeline.transform([df_results['answer_llm'][0]])

array([[ 1.55498588e-01,  1.12196444e-01, -1.27448731e-01,
         7.29928733e-02, -8.37144913e-02,  7.12563053e-02,
        -4.17337192e-02, -8.01481645e-03, -2.48426718e-02,
        -1.96725163e-02, -9.15480506e-03, -2.68991351e-02,
        -4.35971673e-02,  2.90931663e-02, -2.36823437e-02,
        -4.99815087e-02,  6.65563175e-02,  8.34148810e-02,
        -3.29049137e-02,  4.33720994e-02,  1.95608402e-02,
         9.82027760e-03,  6.25639037e-02, -6.11572037e-02,
        -7.70446608e-02, -4.53851792e-02,  9.59588142e-02,
         2.33611751e-02, -3.75310787e-02, -1.20919087e-02,
        -4.42022930e-02, -2.80196178e-02,  4.82849319e-02,
         6.44100659e-02, -6.05386231e-02, -3.75276419e-02,
        -3.73247344e-02,  6.05745643e-02,  2.76423210e-02,
        -6.13524037e-02,  6.59253837e-02, -1.99575837e-02,
         2.05179913e-02,  6.58706921e-03,  7.37963169e-02,
         5.55383382e-03,  7.00592681e-02, -7.29858778e-02,
         2.34118930e-02,  6.07270187e-03, -1.30806947e-0

In [45]:
df_results['v_llm'] = df_results['answer_llm'].apply(lambda s: pipeline.transform([s]))

In [46]:
df_results['v_orig'] = df_results['answer_orig'].apply(lambda s: pipeline.transform([s]))

In [47]:
cosine(df_results['v_llm'][0].squeeze(), df_results['v_orig'][0].squeeze())

0.46352620160029967

In [48]:
df_results_dict = df_results.to_dict(orient='records')

In [49]:
for q in tqdm(df_results_dict):
    # print(q)
    q['cosine_v_llm_v_orig'] = cosine(q['v_llm'].squeeze(), q['v_orig'].squeeze())

  0%|          | 0/1830 [00:00<?, ?it/s]

In [50]:
df_results = pd.DataFrame.from_dict(df_results_dict)

In [51]:
df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,v_llm,v_orig,cosine_v_llm_v_orig
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[[0.1554985879579983, 0.11219644369710972, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.463526
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[[0.1489427947945486, 0.17679213646212116, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.781565
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,"[[0.26248740457158015, 0.14431317946524247, -0...","[[0.22746772878326757, 0.12079641681717065, -0...",0.889158
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,"[[0.2021609665075688, 0.08776325923283387, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.614962
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,"[[0.29736126396042195, -0.00020496794402309962...","[[0.22746772878326757, 0.12079641681717065, -0...",0.624086


In [52]:
df_results['cosine_v_llm_v_orig'].describe()

count    1830.000000
mean        0.841584
std         0.173737
min         0.079093
25%         0.806927
50%         0.905812
75%         0.950711
max         0.996457
Name: cosine_v_llm_v_orig, dtype: float64

## Q6. Rouge

And alternative way to see how two texts are similar is ROUGE. 

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [53]:
# !pip install rouge


(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

In [54]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df_results.iloc[10]
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe.
What's the average Rouge-1 F1?

- 0.25
- 0.35
- 0.45
- 0.55


In [55]:
df_results_dict = df_results.to_dict(orient='records')

In [56]:
for r in tqdm(df_results_dict):
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    r['rough_1_f'] = scores['rouge-1']['f']

  0%|          | 0/1830 [00:00<?, ?it/s]

In [57]:
df_results = pd.DataFrame.from_dict(df_results_dict)
df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,v_llm,v_orig,cosine_v_llm_v_orig,rough_1_f
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[[0.1554985879579983, 0.11219644369710972, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.463526,0.095238
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[[0.1489427947945486, 0.17679213646212116, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.781565,0.125
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,"[[0.26248740457158015, 0.14431317946524247, -0...","[[0.22746772878326757, 0.12079641681717065, -0...",0.889158,0.415584
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,"[[0.2021609665075688, 0.08776325923283387, -0....","[[0.22746772878326757, 0.12079641681717065, -0...",0.614962,0.216216
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,"[[0.29736126396042195, -0.00020496794402309962...","[[0.22746772878326757, 0.12079641681717065, -0...",0.624086,0.142076


In [58]:
df_results['rough_1_f'].describe()

count    1830.000000
mean        0.351695
std         0.158905
min         0.000000
25%         0.238887
50%         0.356300
75%         0.460133
max         0.950000
Name: rough_1_f, dtype: float64