## Module 3 | Homework: Search Evaluation

### Evaluation data

In [1]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

### Evaluation tools

In [2]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

  from .autonotebook import tqdm as notebook_tqdm


## Q1. Minsearch text
Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params:

```boost = {'question': 1.5, 'section': 0.1}```

What's the hitrate for this approach?

- 0.64
- 0.74
- 0.84
- 0.94

In [3]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x7f2ee95ac1a0>

In [4]:
minsearch.__version__

'0.0.4'

In [5]:
def minsearch_search(query):
    return index.search(query=query["question"], filter_dict={'course': query['course']}, boost_dict={'question': 1.5, 'section': 0.1}, num_results=5)

In [6]:
evaluate(ground_truth, minsearch_search)

100%|██████████| 4627/4627 [00:35<00:00, 128.72it/s]


{'hit_rate': 0.848714069591528, 'mrr': 0.7288235717887772}

## Answer: 0.84

## Embeddings

In [5]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [12]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline_question = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline_question.fit_transform(texts)

## Q2. Vector search for question
Now let's index these embeddings with minsearch:

Evaluate this seach method. What's MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55

In [13]:
vindex_q = VectorSearch(keyword_fields={'course'})
vindex_q.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7fbae33f9310>

In [15]:
def vector_search(query, pipeline, vindex):
    return vindex.search(query_vector=pipeline.transform([query["question"]]), filter_dict={'course': query['course']}, num_results=5)

In [16]:
evaluate(ground_truth, lambda q: vector_search(q, pipeline_question, vindex_q))

100%|██████████| 4627/4627 [00:24<00:00, 189.05it/s]


{'hit_rate': 0.48173762697212014, 'mrr': 0.3572833369353793}

## Answer: 0.35

## Q3. Vector search for question and answer
We only used question in Q2. We can use both question and answer

What's the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [17]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

pipeline_qt = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline_qt.fit_transform(texts)


vindex_qt = VectorSearch(keyword_fields={'course'})
vindex_qt.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7fbad9100190>

In [19]:
evaluate(ground_truth, lambda q: vector_search(q, pipeline_qt, vindex_qt))

100%|██████████| 4627/4627 [00:42<00:00, 109.26it/s]


{'hit_rate': 0.8210503566025502, 'mrr': 0.6717347453353508}

## Answer: 0.82

## Q4  Qdrant
Now let's evaluate the following settings in Qdrant:

```python
text = doc['question'] + ' ' + doc['text']
model_handle = "jinaai/jina-embeddings-v2-small-en"
limit = 5
```

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95


In [6]:
from qdrant_client import QdrantClient, models

client = QdrantClient(":memory:")

In [7]:
model = "jinaai/jina-embeddings-v2-small-en"
client.set_model(model)

Fetching 5 files: 100%|██████████| 5/5 [00:03<00:00,  1.54it/s]


In [8]:
collection_name = "FAQ"

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=512,
        distance=models.Distance.COSINE
    )
)

True

In [9]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

  client.create_payload_index(


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [10]:
points = []
id = 0

for doc in documents:
  point = models.PointStruct(
    id=id,
    vector=models.Document(text=doc['question'] + ' ' + doc['text'], model=model),
    payload={
      "text": doc['text'],
      "section": doc['section'],
      "course": doc['course'],
      "question": doc['question'],
      "id": doc['id']
    }
  )
  points.append(point)

  id += 1

In [11]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [12]:
def qdrant_search(query, limit=5):
  results = client.query_points(
    collection_name=collection_name,
    query=models.Document(
      text=query["question"],
      model=model
    ),
    limit=limit,
    with_payload=True,
    query_filter=models.Filter(
      must=[
        models.FieldCondition(
          key="course",
          match=models.MatchValue(value=query["course"])
        )
      ]
    )
  )
  points = results.points
  return [r.payload for r in points]

In [13]:
evaluate(ground_truth, qdrant_search)

100%|██████████| 4627/4627 [07:50<00:00,  9.84it/s]


{'hit_rate': 0.9299762264966501, 'mrr': 0.8517722066133576}

## Answer: 0.85

## Q5. Cosine similarity
In the second part of the module, we looked at evaluating the entire RAG approach. In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

What's the average cosine?

- 0.64
- 0.74
- 0.84
- 0.94



In [15]:
import numpy as np

In [17]:
def cosine(u, v):
  u_norm = np.sqrt(u.dot(u))
  v_norm = np.sqrt(v.dot(v))
  return u.dot(v) / (u_norm * v_norm)

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from our gpt-4o-mini evaluations:

In [18]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [19]:
pipeline = make_pipeline(
  TfidfVectorizer(min_df=3),
  TruncatedSVD(n_components=128, random_state=1)
)

In [22]:
results = df_results.to_dict(orient='records')

In [23]:
texts = []

for r in results:
  t = r['answer_llm'] + ' ' + r['answer_orig'] + ' ' + r['question']
  texts.append(t)

pipeline.fit(texts)

0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


In [24]:
def compute_similarity(record):
  answer_orig = record['answer_orig']
  answer_llm = record['answer_llm']

  v_llm, = pipeline.transform([answer_llm])
  v_orig, = pipeline.transform([answer_orig])

  return cosine(v_llm, v_orig)

In [25]:
similarities = []

for r in results:
  similarity = compute_similarity(r)
  similarities.append(similarity)

In [26]:
similarities = np.array(similarities)
similarities.mean().round(2)

np.float64(0.84)

## Answer: 0.84

## Q6. Rouge
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

In [27]:
%pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [28]:
import rouge
rouge.__version__

'1.0.1'

Let's compute it for the pairs in the entire dataframe. What's the average Rouge-1 F1?

- 0.25
- 0.35
- 0.45
- 0.55

In [29]:
#calculate rouge for record number 10
rouge_scorer = rouge.Rouge()

r = df_results.iloc[10]
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [30]:
f1 = []

for r in results:
  scores = rouge_scorer.get_scores(r["answer_llm"], r["answer_orig"])[0]
  r1_f1 = scores["rouge-1"]["f"]
  f1.append(r1_f1)


np.array(f1).mean().round(2)

np.float64(0.35)

## Answer: 0.35