## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

## Data

We'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv) - only the first 300 documents

In [2]:
import pandas as pd

github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]
print(len(df))
print(df.head(3))

300
                                          answer_llm  \
0  You can sign up for the course by visiting the...   
1  You can sign up using the link provided in the...   
2  Yes, there is an FAQ for the Machine Learning ...   

                                         answer_orig  document  \
0  Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
1  Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
2  Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   

                                            question  \
0                Where can I sign up for the course?   
1                 Can you provide a link to sign up?   
2  Is there an FAQ for this Machine Learning course?   

                      course  
0  machine-learning-zoomcamp  
1  machine-learning-zoomcamp  
2  machine-learning-zoomcamp  


In [3]:
documents = df.to_dict(orient='records')
documents[0]

{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

## Q1. Embeddings model

Get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [4]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(model_name)

Create the embeddings for the first LLM answer.

What's the first value of the resulting vector?

* -0.42 <--
* -0.22
* -0.02
* 0.21

In [5]:
answer_llm = df.iloc[0].answer_llm
print("answer_llm:", answer_llm)
embedding_llm = model.encode(answer_llm)
print("embedding_llm:", embedding_llm[0])

answer_llm: You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).


embedding_llm: -0.42244655


## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67 <--
* 41.67
* 51.67

In [6]:
from tqdm.auto import tqdm

evaluations = []
for doc in tqdm(documents):
    answer_llm_embed = model.encode(doc['answer_llm'])
    answer_orig_embed = model.encode(doc['answer_orig'])
    dot_product = answer_llm_embed.dot(answer_orig_embed)
    evaluations.append(dot_product)

print(evaluations[:5])

  0%|          | 0/300 [00:00<?, ?it/s]

[17.515987, 13.418402, 25.313255, 12.147415, 18.747736]


In [7]:
import numpy as np
print(np.percentile(evaluations, 75))

31.67430877685547


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`


In [8]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    return 0 if norm==0 else (v / norm)

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?
* 0.63
* 0.73
* 0.83 <--
* 0.93

In [9]:
evaluations_norm = []
for doc in tqdm(documents):
    answer_llm_embed_norm = normalize_vector(model.encode(doc['answer_llm']))
    answer_orig_embed_norm = normalize_vector(model.encode(doc['answer_orig']))
    dot_product = answer_llm_embed_norm.dot(answer_orig_embed_norm)
    evaluations_norm.append(dot_product)

print(evaluations_norm[:5])

  0%|          | 0/300 [00:00<?, ?it/s]

[0.5067539, 0.38854873, 0.7185989, 0.33726627, 0.5217923]


In [10]:
print(np.percentile(evaluations_norm, 75))

0.8362348973751068


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```


In [None]:
!pip install rouge==1.0.1

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?
- 0.35
- 0.45 <--
- 0.55
- 0.65

In [11]:
documents[10]

{'answer_llm': "Yes, all sessions are recorded, so if you miss one, you won't miss anything. You can catch up on the content later. Additionally, you can submit your questions in advance for office hours, and those sessions are also recorded.",
 'answer_orig': 'Everything is recorded, so you won’t miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.',
 'document': '5170565b',
 'question': 'Are sessions recorded if I miss one?',
 'course': 'machine-learning-zoomcamp'}

In [12]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(documents[10]['answer_llm'], documents[10]['answer_orig'])[0]
print(scores)


{'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}}


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35 <--
- 0.45
- 0.55
- 0.65

In [13]:
avg = np.mean([scores['rouge-1']['f'], scores['rouge-2']['f'], scores['rouge-l']['f']])
print(avg)

0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
```

Create a dataframe from them

What's the agerage `rouge_2` across all the records?
- 0.10
- 0.20 <--
- 0.30
- 0.40

In [14]:
def rouge_avg_score(scores):
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']   

    return {'rouge_1': rouge_1, 'rouge_2': rouge_2, 'rouge_l': rouge_l} 

In [15]:
evaluations_rouge = []
for doc in tqdm(documents):
    scores = rouge_scorer.get_scores(doc['answer_llm'], doc['answer_orig'])[0]
    avg_score = rouge_avg_score(scores)
    evaluations_rouge.append(avg_score)

print(evaluations_rouge[:5])

  0%|          | 0/300 [00:00<?, ?it/s]

[{'rouge_1': 0.09523809178130524, 'rouge_2': 0.028169010918468917, 'rouge_l': 0.09523809178130524}, {'rouge_1': 0.12499999641113292, 'rouge_2': 0.05555555225694465, 'rouge_l': 0.09374999641113295}, {'rouge_1': 0.41558441095631643, 'rouge_2': 0.17777777313333343, 'rouge_l': 0.3896103849822905}, {'rouge_1': 0.2162162117421476, 'rouge_2': 0.047058819111419105, 'rouge_l': 0.18918918471512064}, {'rouge_1': 0.14207649881095297, 'rouge_2': 0.03389830142092829, 'rouge_l': 0.12021857531368524}]


In [16]:
df_rouge_score = pd.DataFrame(evaluations_rouge)
df_rouge_score.head()

Unnamed: 0,rouge_1,rouge_2,rouge_l
0,0.095238,0.028169,0.095238
1,0.125,0.055556,0.09375
2,0.415584,0.177778,0.38961
3,0.216216,0.047059,0.189189
4,0.142076,0.033898,0.120219


In [17]:
df_rouge_score['rouge_2'].mean()

0.20696501983423318