In [14]:
import pandas as pd
import numpy as np
from rouge import Rouge

from sentence_transformers import SentenceTransformer


In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'

In [3]:
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [4]:
df = df.iloc[:300]

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* **-0.42**
* -0.22
* -0.02
* 0.21

In [5]:
model_name = 'multi-qa-mpnet-base-dot-v1'

embedding_model = SentenceTransformer(model_name)

In [6]:
answer_llm = df.iloc[0].answer_llm
embeddings = embedding_model.encode(answer_llm)

In [7]:
first_value = embeddings[0]
print(f"The first value of the resulting vector is: {first_value}")

The first value of the resulting vector is: -0.4224465489387512


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* **31.67**
* 41.67
* 51.67

In [8]:
def compute_dot_product(emb1, emb2):
    return np.dot(emb1, emb2)

In [10]:
df.head(5)

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [11]:
evaluations = []

for _, row in df.iterrows():
    emb_llm = embedding_model.encode(row.answer_llm)
    emb_orig = embedding_model.encode(row.answer_orig)  
    score = compute_dot_product(emb_llm, emb_orig)
    evaluations.append(score)

percentile_75 = np.percentile(evaluations, 75)
print(f"The 75th percentile of the scores is: {percentile_75}")

The 75th percentile of the scores is: 31.67430877685547


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* **0.83**
* 0.93

In [25]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

def compute_cosine_similarity(emb1, emb2):
    emb1_norm = normalize_vector(emb1)
    emb2_norm = normalize_vector(emb2)
    return np.dot(emb1_norm, emb2_norm)

In [26]:
evaluations = []

for _, row in df.iterrows():
    emb_llm = embedding_model.encode(row.answer_llm)
    emb_orig = embedding_model.encode(row.answer_orig)
    score = compute_cosine_similarity(emb_llm, emb_orig)
    evaluations.append(score)

percentile_75 = np.percentile(evaluations, 75)
print(f"The 75th percentile of the cosine similarity scores is: {percentile_75}")

The 75th percentile of the cosine similarity scores is: 0.8362348973751068


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- **0.45**
- 0.55
- 0.65

In [17]:
# Initialize the ROUGE scorer
rouge_scorer = Rouge()

In [18]:
# Get the row at index 10
r = df.iloc[10]

print(r)

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object


In [19]:
# Compute ROUGE scores
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

print(scores)

{'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.3939393939393939, 'p': 0.3939393939393939, 'f': 0.393939388939394}}


In [28]:
# Extract the F1 score for ROUGE-1
rouge_1_f1 = scores['rouge-1']['f']
print(f"rouge-1 F1: {rouge_1_f1}")

rouge-1 F1: 0.45454544954545456


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- **0.35**
- 0.45
- 0.55
- 0.65

In [22]:
# Extract F1 scores ROUGE-2, and ROUGE-L
rouge_2_f1 = scores['rouge-2']['f']
rouge_l_f1 = scores['rouge-l']['f']

In [23]:
average_rouge = (rouge_1_f1 + rouge_2_f1 + rouge_l_f1) / 3

In [29]:
print(f"rouge-2 F1: {rouge_2_f1}")
print(f"rouge-l F1: {rouge_l_f1}")
print(f"Average rouge: {average_rouge}")

rouge-2 F1: 0.21621621121621637
rouge-l F1: 0.393939388939394
Average rouge: 0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the agerage `rouge_l` across all the records?

- 0.10
- 0.20
- 0.30
- **0.40**

In [30]:
rouge_scorer = Rouge()
rouge_l_scores = []

for _, row in df.iterrows():
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
    rouge_l = scores['rouge-l']['f']
    rouge_l_scores.append(rouge_l)

average_rouge_l = np.mean(rouge_l_scores)

In [31]:
# Round to two decimal places
rounded_average_rouge_l = round(average_rouge_l, 1)

print(f"The average ROUGE-L score across all records is: {average_rouge_l}")
print(f"Rounded to one decimal places: {rounded_average_rouge_l}")

The average ROUGE-L score across all records is: 0.3538074656078652
Rounded to one decimal places: 0.4
