## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

Solution:

* Video: TBA
* Notebook: TBA

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* ***-0.42*** ✅
* -0.22
* -0.02
* 0.21

In [142]:
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer



github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

In [143]:
df.sample(5)

Unnamed: 0,answer_llm,answer_orig,document,question,course
72,"To be eligible for a certificate, you must sub...","Yes, you can. You won’t be able to submit some...",ee58a693,How many course projects must I submit to be e...,machine-learning-zoomcamp
281,You can load data from a GitHub link into pand...,The dataset can be read directly to pandas dat...,0b3eaf92,What method allows me to load data from a GitH...,machine-learning-zoomcamp
41,"To receive a certificate, you need to submit a...","Yes, if you finish at least 2 out of 3 project...",2eba08e3,What are the requirements to receive a certifi...,machine-learning-zoomcamp
76,"Yes, you can start the course anytime. The cou...",The course is available in the self-paced mode...,636f55d5,Can I start the course anytime?,machine-learning-zoomcamp
250,You can ask questions for the Live Sessions fo...,Here are the crucial links for this Week 2 tha...,50d737e7,Where can I ask questions for the Live Session...,machine-learning-zoomcamp


In [144]:
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [145]:
df['vector_llm'] = df['answer_llm'].apply(lambda x: embedding_model.encode(x))

In [146]:
df.iloc[0].answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [147]:
df.iloc[0].vector_llm[0]

-0.4224469


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* ***31.67*** ✅
* 41.67
* 51.67

In [148]:
df['vector_orig'] = df['answer_orig'].apply(lambda x: embedding_model.encode(x))

In [149]:
def calculate_dot_product(row: pd.Series) -> float:
    return np.dot(row['vector_llm'], row['vector_orig'])

In [150]:
df['evaluation'] = df.apply(calculate_dot_product, axis=1)

In [151]:
df['evaluation'].describe()

count    300.000000
mean      27.495996
std        6.384745
min        4.547921
25%       24.307847
50%       28.336866
75%       31.674306
max       39.476017
Name: evaluation, dtype: float64

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* ***0.83*** ✅
* 0.93

In [152]:
def cosine_similarity(row: pd.Series) -> float:

    v_llm = row['vector_llm']
    norm = np.sqrt((v_llm * v_llm).sum())
    v_llm_norm = v_llm / norm

    v_orig = row['vector_orig']
    norm = np.sqrt((v_orig * v_orig).sum())
    v_orig_norm = v_orig / norm


    return np.dot(v_llm_norm, v_orig_norm)

In [153]:
df['cosine'] = df.apply(cosine_similarity, axis=1)

In [154]:
df['cosine'].describe()

count    300.000000
mean       0.728392
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine, dtype: float64

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35 
- ***0.45*** ✅
- 0.55
- 0.65

In [155]:
# %pip install rogue==1.0.1

In [156]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df.iloc[10]
r['document']

'5170565b'

In [157]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
round(scores['rouge-1']['f'], 2)

0.45

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- ***0.35*** ✅
- 0.45
- 0.55
- 0.65

In [158]:
mean_r = ([v['f'] for v in scores.values()])
round(np.mean(mean_r), 2)

0.35

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- ***0.30*** ✅
- 0.40

In [159]:
def calculate_rouge(row: pd.Series) -> float:
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    return rouge_avg

In [160]:
df['rogue'] = df.apply(calculate_rouge, axis=1).mean()   

In [161]:
df['rogue'].describe()['mean']

0.313205367339838