## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Gather data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [1]:
import pandas as pd

In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [4]:
df = df.iloc[:300]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   answer_llm   300 non-null    object
 1   answer_orig  300 non-null    object
 2   document     300 non-null    object
 3   question     300 non-null    object
 4   course       300 non-null    object
dtypes: object(5)
memory usage: 11.8+ KB


## 1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [None]:
from sentence_transformers import SentenceTransformer
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

In [8]:
answer_llm = df.iloc[0].answer_llm
answer_llm_embedings = embedding_model.encode(answer_llm)
answer_llm_embedings[0]

-0.42244673

## 2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [11]:
from tqdm.notebook import tqdm

In [16]:
evaluations = list()
for item in tqdm(df.itertuples()):
    v = embedding_model.encode(item.answer_llm)
    w = embedding_model.encode(item.answer_orig)
    result = v.dot(w)
    evaluations.append(result)

0it [00:00, ?it/s]

In [19]:
import numpy as np

np.percentile(evaluations, 75).round(2)

31.67

## 3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [33]:
evaluations = list()
for item in tqdm(df.itertuples()):
    v = embedding_model.encode(item.answer_llm)
    w = embedding_model.encode(item.answer_orig)
    result = np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))
    evaluations.append(result)

0it [00:00, ?it/s]

In [36]:
np.percentile(evaluations, 75).round(2)

0.84

## 4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [37]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df.iloc[10].to_dict()
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [38]:
round(scores['rouge-1']['f'], 2)

0.45

## 5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65



In [40]:
np.average([scores['rouge-1']['f'], scores['rouge-2']['f'], scores['rouge-l']['f']]).round(2)

0.35

## 6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40


In [42]:
rouge_1_scores = list()
rouge_2_scores = list()
rouge_l_scores = list()

for item in tqdm(df.itertuples()):
    scores = rouge_scorer.get_scores(item.answer_llm, item.answer_orig)[0]
    rouge_1_scores.append(scores['rouge-1']['f'])
    rouge_2_scores.append(scores['rouge-2']['f'])
    rouge_l_scores.append(scores['rouge-l']['f'])

0it [00:00, ?it/s]

In [46]:
df_rouge = pd.DataFrame({
    'rouge_1': rouge_1_scores,
    'rouge_2': rouge_2_scores,
    'rouge_l': rouge_l_scores
})
df_rouge.head()

Unnamed: 0,rouge_1,rouge_2,rouge_l
0,0.095238,0.028169,0.095238
1,0.125,0.055556,0.09375
2,0.415584,0.177778,0.38961
3,0.216216,0.047059,0.189189
4,0.142076,0.033898,0.120219


In [54]:
df_rouge.rouge_2.describe()

count    300.000000
mean       0.206965
std        0.153550
min        0.000000
25%        0.097809
50%        0.178671
75%        0.286181
max        0.739130
Name: rouge_2, dtype: float64