### Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

### Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

Read it:

In [1]:
import pandas as pd
import numpy as np
import os

In [3]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

We will use only the first 300 documents:

In [4]:
df = df.iloc[:300]
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


### Q1. Getting the embeddings model

Now, get the embeddings model multi-qa-mpnet-base-dot-v1 from the [Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [5]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Create the embeddings for the first LLM answer:

In [6]:
answer_llm = df.iloc[0].answer_llm

In [7]:
answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [8]:
query_embedding = embedding_model.encode(answer_llm)
query_embedding

array([-4.22446549e-01, -2.24856257e-01, -3.24058414e-01, -2.84758478e-01,
        7.25642918e-03,  1.01186566e-01,  1.03716910e-01, -1.89983174e-01,
       -2.80599259e-02,  2.71588802e-01, -1.15337655e-01,  1.14666030e-01,
       -8.49586725e-02,  3.32365334e-01,  5.52720726e-02, -2.22195774e-01,
       -1.42540857e-01,  1.02519155e-01, -1.52333647e-01, -2.02912465e-01,
        1.98422875e-02,  8.38149190e-02, -5.68632066e-01,  2.32844148e-02,
       -1.67292684e-01, -2.39256918e-01, -8.05464387e-02,  2.57084146e-02,
       -8.15464780e-02, -7.39290118e-02, -2.61550009e-01,  1.92575473e-02,
        3.22909206e-01,  1.90357104e-01, -9.34726413e-05, -2.13165611e-01,
        2.88943425e-02, -1.79530401e-02, -5.92756271e-02,  1.99918285e-01,
       -4.75170948e-02,  1.71634093e-01, -2.45917086e-02, -9.38061550e-02,
       -3.57002735e-01,  1.33263692e-01,  1.94045901e-01, -1.18530318e-01,
        4.56915230e-01,  1.47728190e-01,  3.35945129e-01, -1.86959356e-01,
        2.45954901e-01, -

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
*  0.21

### A1.-0.42 is the first value of the embedded vector.

### Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

In [11]:
evaluations = []

for index, row in df.iterrows():
    answer_llm_embeddings = embedding_model.encode(row['answer_llm'])
    answer_orig_embeddings = embedding_model.encode(row['answer_orig'])
    
    # Compute dot product
    dot_product = np.dot(answer_llm_embeddings, answer_orig_embeddings)
    evaluations.append(dot_product)

df['evaluations'] = evaluations

In [12]:
df['evaluations'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: evaluations, dtype: float64

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

### A2. The 75th percentile of the score is 31.67.

### Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

* Compute the norm of a vector
* Divide each element by this norm

So, for vector v, it'll be ```v / ||v||```

In numpy, this is how you do it:

In [None]:
norm = np.sqrt((v * v).sum())
v_norm = v / norm

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity.

In [13]:
def normalize_vector(vector):
    """
    Normalize a vector by dividing each element by its norm (the magnitude of the vector).

    Parameters:
    vector (np.array): Input vector.

    Returns:
    np.array: Normalized vector.
    """
    norm = np.linalg.norm(vector)
    return vector / norm if norm != 0 else vector

In [14]:
def cosine_similarity(vector1, vector2):
    """
    Compute the cosine similarity between two vectors.

    Parameters:
    vector1 (np.array): First input vector.
    vector2 (np.array): Second input vector.

    Returns:
    float: Cosine similarity between the two vectors.
    """
    normalized_vector1 = normalize_vector(vector1)
    normalized_vector2 = normalize_vector(vector2)
    return np.dot(normalized_vector1, normalized_vector2)

In [15]:
cosine_values = []

for index, row in df.iterrows():
    answer_llm_embeddings = embedding_model.encode(row['answer_llm'])
    answer_orig_embeddings = embedding_model.encode(row['answer_orig'])
    
    # Compute dot product
    cosine_sim = cosine_similarity(answer_llm_embeddings, answer_orig_embeddings)
    cosine_values.append(cosine_sim)

df['cosine'] = cosine_values

In [16]:
df['cosine'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine, dtype: float64

What's the 75% cosine in the scores?

*   0.63
*   0.73
*   0.83
*   0.93

### A3. 75% cosine of the scores is 0.83.

### Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [18]:
!pip install --upgrade pip
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pip
  Downloading pip-24.1.2-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.1
    Uninstalling pip-24.1.1:
      Successfully uninstalled pip-24.1.1
Successfully installed pip-24.1.2


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




(The latest version at the moment of writing is 1.0.1)

[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation

Let's compute the [ROUGE score](https://huggingface.co/spaces/evaluate-metric/rouge) between the answers at the index 10 of our dataframe (doc_id=5170565b).

Inputs

*    __predictions__ (list): list of predictions to score. Each prediction should be a string with tokens separated by spaces.
*    __references__ (list or list[list]): list of reference for each prediction or a list of several references per prediction. Each reference should be a string with tokens separated by spaces.
*    __rouge_types__ (list): A list of rouge types to calculate. Defaults to ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'].
        Valid rouge types:
            * "rouge1": unigram (1-gram) based scoring
            * "rouge2": bigram (2-gram) based scoring
            * "rougeL": Longest common subsequence based scoring.
            * "rougeLSum": splits text using "\n"
*    __use_aggregator__ (boolean): If True, returns aggregates. Defaults to True.
*    __use_stemmer__ (boolean): If True, uses Porter stemmer to strip word suffixes. Defaults to False.

Note: "f" stands for f1_score, "p" stands for precision, "r" stands for recall.

In [26]:
from rouge import Rouge

rouge_scorer = Rouge()
scores = rouge_scorer.get_scores(df['answer_llm'], df['answer_orig'])[10]

In [27]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

*    rouge-1 - the overlap of unigrams,
*    rouge-2 - bigrams,
*    rouge-l - the longest common subsequence

What's the F score for rouge-1?

*    0.35
*    0.45
*    0.55
*    0.65

### Q4. The F score for rouge-1 between the answers at index 10 of our dataframe (doc_id=5170565b) is 0.45.

### Q5. Average rouge score

Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4

In [34]:
rouge_avg = (scores['rouge-1']['f'] + scores['rouge-2']['f'] + scores['rouge-l']['f'])/3
rouge_avg

0.35490034990035496

*    0.35
*    0.45
*    0.55
*    0.65

### Q5. The average rouge score between rouge-1, rouge-2, and rouge-l for the answers at index 10 of our dataframe (doc_id=5170565b) is 0.35.

### Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```
And create a dataframe from them

In [41]:
rouge_scorer = Rouge()

rouge_1_scores = []
rouge_2_scores = []
rouge_l_scores = []
rouge_avg_scores = []

for index, row in df.iterrows():

    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'], avg=True)
    
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3

    
    rouge_1_scores.append(rouge_1)
    rouge_2_scores.append(rouge_2)
    rouge_l_scores.append(rouge_l)
    rouge_avg_scores.append(rouge_avg)
    

df['rouge_1'] = rouge_1_scores
df['rouge_2'] = rouge_2_scores
df['rouge_l'] = rouge_l_scores
df['rouge_avg'] = rouge_avg_scores

In [42]:
avg_rouge_2 = df['rouge_2'].mean()
avg_rouge_2

0.20696501983423318

What's the average rouge_2 across all the records?

*    0.10
*    0.20
*    0.30
*    0.40

### Q6. The average rouge_2 across all records is 0.20.