# Module 4 Homework

## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

Solution:

* Video: TBA
* Notebook: TBA

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

In [1]:
# Importing pandas and defining `github_url` variable
import pandas as pd

github_url: str = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"

In [2]:
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

We will use only the first 300 documents:

In [3]:
df = df.iloc[:300]

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

In [4]:
# Added code block defining `model_name` variable
model_name: str = "multi-qa-mpnet-base-dot-v1"

In [5]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


Create the embeddings for the first LLM answer:

In [6]:
answer_llm = df.iloc[0].answer_llm

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

### Notes:

:white_check_mark: -0.42

In [7]:
type Embedding = "list[Tensor] | ndarray | Tensor"
v: Embedding = embedding_model.encode(answer_llm)
print("Q1 Answer: The first value of the resulting vector is {:.2f}".format(v[0]))

Q1 Answer: The first value of the resulting vector is -0.42


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

### Notes:

:white_check_mark: 31.67

In [8]:
from typing import TypedDict


class AnswerRecord(TypedDict):
    answer_orig: str
    answer_llm: str


def compute_similarity(
    record: AnswerRecord,
    model: "SentenceTransformer"
    ) -> "Embedding":
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [9]:
df_records: list[dict] = df.to_dict(orient="records")

In [10]:
from tqdm.auto import tqdm


evaluations: list[float] = []
for record in tqdm(df_records):
    evaluations.append(compute_similarity(record, embedding_model))

  0%|          | 0/300 [00:00<?, ?it/s]

In [11]:
import statistics


print(pd.Series(evaluations).describe())

quantiles: list[float] = statistics.quantiles(evaluations, n=4, method="inclusive")
print("Q2 Answer: The 75% percentile of the score is {:.2f}".format(quantiles[2]))

count    300.000000
mean      27.495996
std        6.384745
min        4.547922
25%       24.307833
50%       28.336869
75%       31.674322
max       39.476021
dtype: float64
Q2 Answer: The 75% percentile of the score is 31.67


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

### Notes:

:white_check_mark: 0.83

In [12]:
from typing import TypedDict

import numpy as np


class AnswerRecord(TypedDict):
    answer_orig: str
    answer_llm: str


def compute_cosine_similarity(
    record: AnswerRecord,
    model: "SentenceTransformer"
    ) -> "Embedding":
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)

    v_llm_norm = v_llm / np.sqrt((v_llm ** 2).sum())
    v_orig_norm = v_orig / np.sqrt((v_orig ** 2).sum())
    
    return v_llm_norm.dot(v_orig_norm)

In [13]:
from tqdm.auto import tqdm


evaluations_cosine: list[float] = []
for record in tqdm(df_records):
    evaluations_cosine.append(
        compute_cosine_similarity(record, embedding_model)
    )

  0%|          | 0/300 [00:00<?, ?it/s]

In [14]:
import statistics


print(pd.Series(evaluations_cosine).describe())

quantiles_cosine: list[float] = statistics.quantiles(evaluations_cosine, n=4, method="inclusive")
print("Q3 Answer: The 75% cosine in the scores is {:.4f}".format(quantiles_cosine[2]))

count    300.000000
mean       0.728392
std        0.157755
min        0.125357
25%        0.651274
50%        0.763761
75%        0.836235
max        0.958796
dtype: float64
Q3 Answer: The 75% cosine in the scores is 0.8362


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

In [15]:
# Defining `r` variable
r = df.iloc[10]
r

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [16]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

### Notes:

:white_check_mark: 0.45

In [17]:
print(scores)

print("Q4 Answer: The F score for `rouge-1` is {:.2f}".format(scores["rouge-1"]["f"]))

{'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}}
Q4 Answer: The F score for `rouge-1` is 0.45


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

### Notes:

:white_check_mark: 0.35

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

In [18]:
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3

In [19]:
print("Q5 Answer: The average between `rouge-1`, `rouge-2` "
      + "and `rouge-l` for the same record from Q4 is {:.2f}"
      .format(rouge_avg))

Q5 Answer: The average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4 is 0.35


And create a dataframe from them

What's the agerage `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

### Notes:

:white_check_mark: 0.20

In [20]:
record_rouge_scores = []
for record in tqdm(df_records):
    rogue_scores = rouge_scorer.get_scores(record['answer_llm'], record['answer_orig'])[0]
    record_rouge_scores.append(rogue_scores)

record_rouge_scores[:4]

  0%|          | 0/300 [00:00<?, ?it/s]

[{'rouge-1': {'r': 0.061224489795918366,
   'p': 0.21428571428571427,
   'f': 0.09523809178130524},
  'rouge-2': {'r': 0.017543859649122806,
   'p': 0.07142857142857142,
   'f': 0.028169010918468917},
  'rouge-l': {'r': 0.061224489795918366,
   'p': 0.21428571428571427,
   'f': 0.09523809178130524}},
 {'rouge-1': {'r': 0.08163265306122448,
   'p': 0.26666666666666666,
   'f': 0.12499999641113292},
  'rouge-2': {'r': 0.03508771929824561,
   'p': 0.13333333333333333,
   'f': 0.05555555225694465},
  'rouge-l': {'r': 0.061224489795918366, 'p': 0.2, 'f': 0.09374999641113295}},
 {'rouge-1': {'r': 0.32653061224489793,
   'p': 0.5714285714285714,
   'f': 0.41558441095631643},
  'rouge-2': {'r': 0.14035087719298245,
   'p': 0.24242424242424243,
   'f': 0.17777777313333343},
  'rouge-l': {'r': 0.30612244897959184,
   'p': 0.5357142857142857,
   'f': 0.3896103849822905}},
 {'rouge-1': {'r': 0.16326530612244897, 'p': 0.32, 'f': 0.2162162117421476},
  'rouge-2': {'r': 0.03508771929824561,
   'p': 0

In [21]:
from functools import reduce
from typing import TypedDict


class RougeDict(TypedDict):
    r: float
    p: float
    f: float


Rouge2Key = TypedDict("Rouge2", { "rouge-2": RougeDict })


def add_rouge_2(rouge: Rouge2Key, rouge2: Rouge2Key) -> Rouge2Key:
    return {
        "rouge-2": {
            "r": rouge["rouge-2"]["r"] + rouge2["rouge-2"]["r"],
            "p": rouge["rouge-2"]["p"] + rouge2["rouge-2"]["p"],
            "f": rouge["rouge-2"]["f"] + rouge2["rouge-2"]["f"],
        }
    }
    
       
sum_rouge_2: Rouge2Key = reduce(add_rouge_2, record_rouge_scores)

rouge_2_dict: RougeDict = sum_rouge_2["rouge-2"]
for key in rouge_2_dict:
    rouge_2_dict[key] = rouge_2_dict[key] / len(record_rouge_scores)

print(sum_rouge_2)
print("Q6 Answer: The average score for `rouge-2` across all the records is {:.4f}".format(rouge_2_dict["f"]))

{'rouge-2': {'r': 0.19861258009846802, 'p': 0.2586264651699855, 'f': 0.20696501983423318}}
Q6 Answer: The average score for `rouge-2` across all the records is 0.2070
