## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

In [7]:
import requests
import pandas as pd
url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv?raw=1'
df = pd.read_csv(url)


In [55]:
df = df.iloc[:300]
df.sample(4)

Unnamed: 0,answer_llm,answer_orig,document,question,course
105,When posting about what you learned from the c...,When you post about what you learned from the ...,f7bc2f65,What tag should I use when posting about my co...,machine-learning-zoomcamp
154,"To read a file with Pandas in Windows, you sho...",How do I read the dataset with Pandas in Windo...,be760b92,Can you show an example of reading a file with...,machine-learning-zoomcamp
45,"Yes, you could still receive a certificate eve...","Yes, it's possible. See the previous answer.",1d644223,Will I receive a certificate if I don't comple...,machine-learning-zoomcamp
215,The mathematical formula for linear regression...,In Question 7 we are asked to calculate\nThe i...,183a1c90,What is the mathematical formula for linear re...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21


In [11]:
df.iloc[0]

answer_llm     You can sign up for the course by visiting the...
answer_orig    Machine Learning Zoomcamp FAQ\nThe purpose of ...
document                                                0227b872
question                     Where can I sign up for the course?
course                                 machine-learning-zoomcamp
Name: 0, dtype: object

In [12]:
df.iloc[0].answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [15]:
answer_llm = df.iloc[0].answer_llm

In [44]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [52]:
resulting_vector = embedding_model.encode(answer_llm)
resulting_vector = round(float(resulting_vector[0]),2)
resulting_vector

-0.42

## Q1 Ans: The resulting vector is -0.42


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67


In [63]:
def similarity(rec, model):
    orig_ans = rec['answer_orig']
    llm_ans = rec['answer_llm']

    vec_orig = model.encode(orig_ans)
    vec_llm = model.encode(llm_ans)

    return vec_orig.dot(vec_llm)




evaluations = []

for index, rec in tqdm(df.iterrows(), total=df.shape[0]):
    product = similarity(rec, embedding_model)
    evaluations.append(product)



evaluations_series = pd.Series(evaluations)
summary = evaluations_series.describe()
print(summary)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:24<00:00,  2.07it/s]

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
dtype: float64





## Q2 Ans: The 75% percentile of the score is 31.67


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [65]:
def Normalize(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

def Normalize_Similarity(rec, model):
    orig_ans = rec['answer_orig']
    llm_ans = rec['answer_llm']

    vec_orig = model.encode(orig_ans)
    vec_llm = model.encode(llm_ans)

    vec_orig = Normalize(vec_orig)
    vec_llm  = Normalize(vec_llm)

    return vec_orig.dot(vec_llm)


Normalize_evaluations = []

for index, rec in tqdm(df.iterrows(), total=df.shape[0]):
    product = Normalize_Similarity(rec, embedding_model)
    Normalize_evaluations.append(product)


Normalize_evaluations_series = pd.Series(Normalize_evaluations)
summary = Normalize_evaluations_series.describe()
print(summary)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:25<00:00,  2.06it/s]

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
dtype: float64





## Q3 Ans: The 75% percentile of the score is 0.83

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [66]:
from rouge import Rouge
rouge_scorer = Rouge()

In [69]:
scores = rouge_scorer.get_scores(rec['answer_llm'], rec['answer_orig'])[0]
scores

{'rouge-1': {'r': 0.125, 'p': 0.3181818181818182, 'f': 0.17948717543721246},
 'rouge-2': {'r': 0.01694915254237288,
  'p': 0.038461538461538464,
  'f': 0.023529407518339866},
 'rouge-l': {'r': 0.10714285714285714,
  'p': 0.2727272727272727,
  'f': 0.1538461497961868}}

In [84]:
scores = rouge_scorer.get_scores(df[df['document']=='5170565b']['answer_llm'], df[df['document']=='5170565b']['answer_orig'])[0]
scores['rouge-1']['f']

0.45454544954545456

## Ans 4: The F score for rouge-1 is 0.45

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65

In [87]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [93]:
scores = rouge_scorer.get_scores(df[df['document']=='5170565b']['answer_llm'], df[df['document']=='5170565b']['answer_orig'])[0]
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_3 = scores['rouge-l']['f']

average = np.mean([rouge_1,rouge_2,rouge_3])
average = round(float(average),2)
average

0.35

## Ans 5: The average between rouge-1, rouge-2 and rouge-l is 0.35

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the average `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [95]:
def calculate_rouge_scores(row):
    scores = rouge_scorer.get_scores(row['answer_orig'], row['answer_llm'])[0]
    return pd.Series({
        'rouge-1': scores['rouge-1']['f'],
        'rouge-2': scores['rouge-2']['f'],
        'rouge-l': scores['rouge-l']['f'],
        'rouge-avg': (scores['rouge-1']['f'] + scores['rouge-2']['f'] + scores['rouge-l']['f']) / 3
    })

tqdm.pandas()
scores_df = df.progress_apply(calculate_rouge_scores, axis=1)

average_rouge_2 = scores_df['rouge-2'].mean()
print(f"The average ROUGE-2 score across all records is: {average_rouge_2}")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:01<00:00, 284.41it/s]

The average ROUGE-2 score across all records is: 0.20696501983423318





## Ans 6: The average ROUGE-2 score across all records is: 0.20