In [2]:
import pandas as pd

github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'

url = f'{github_url}?raw=1'

df = pd.read_csv(url)

In [3]:
df = df.iloc[:300]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   answer_llm   300 non-null    object
 1   answer_orig  300 non-null    object
 2   document     300 non-null    object
 3   question     300 non-null    object
 4   course       300 non-null    object
dtypes: object(5)
memory usage: 11.8+ KB


## Q1. Getting the embeddings model

In [7]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'

embedding_model = SentenceTransformer(model_name)

In [8]:
answer_llm = df.iloc[0].answer_llm

answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

### What's the first value of the resulting vector?

- [x] -0.42
- [ ] -0.22
- [ ] -0.02
- [ ] 0.21

In [11]:
v = embedding_model.encode(answer_llm)

v[0]

-0.42244655

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

### What's the 75% percentile of the score?

- [ ] 21.67
- [x] 31.67
- [ ] 41.67
- [ ] 51.67

In [14]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [17]:
results = df.to_dict(orient='records')

results[0]


{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

In [18]:
from tqdm.auto import tqdm

similarity = []

for record in tqdm(results):
    sim = compute_similarity(record)
    similarity.append(sim)

100%|████████████████████████████████| 300/300 [02:20<00:00,  2.13it/s]


In [19]:
df['dot'] = similarity
df['dot'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: dot, dtype: float64

## Q3. Computing the cosine
From Q2, we can see that the results are not within the `[0, 1]` range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

Compute the norm of a vector
Divide each element by this norm
So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```
import numpy as np

norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

### What's the 75% cosine in the scores?

- [ ] 0.63
- [ ] 0.73
- [x] 0.83
- [ ] 0.93

In [29]:
import numpy as np

def cosine_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)

    # Calculate the dot product
    dot_product = np.dot(v_llm, v_orig)
    
    # Calculate the norm (magnitude)
    norm_v_llm = np.linalg.norm(v_llm)
    norm_v_orig = np.linalg.norm(v_orig)
    
    # Calculate cosine similarity
    cosine_similarity = dot_product / (norm_v_llm * norm_v_orig)
    
    return cosine_similarity

In [30]:
cosine_sims = []

for record in tqdm(results):
    cosine_sim = cosine_similarity(record)
    cosine_sims.append(cosine_sim)

100%|████████████████████████████████| 300/300 [02:19<00:00,  2.15it/s]


In [31]:
df['cosine'] = cosine_sims
df['cosine'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651274
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine, dtype: float64

## Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```pip install rouge```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```
There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

- `rouge-1` - the overlap of unigrams,
- `rouge-2` - bigrams,
- `rouge-l` - the longest common subsequence

### What's the F score for rouge-1?

- [ ] 0.35
- [x] 0.45
- [ ] 0.55
- [ ] 0.65

In [32]:
from rouge import Rouge

rouge_scorer = Rouge()


In [46]:
df[['answer_llm', 'answer_orig']].to_dict(orient='records')[:2]

[{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
  'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork'},
 {'answer_llm': 'You can sign up using the link provided in the course GitHub repository: [https://airtable.com/shryxwLd0COOEaqXo](https://airtable.com/shryxwLd0COOEaqXo).',
  'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your question

In [41]:
r = df.iloc[10]
r

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
dot                                                    32.344711
cosine                                                  0.777956
Name: 10, dtype: object

In [40]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- [x] 0.35
- [ ] 0.45
- [ ] 0.55
- [ ] 0.65

In [42]:
f_avg = (scores['rouge-1']['f'] + scores['rouge-2']['f'] + scores['rouge-l']['f']) / 3
f_avg

0.35490034990035496

## Q6. Average rouge score for all the data points
Now let's compute the score for all the records and create a dataframe from them.

### What's the average `rouge_2` across all the records?

- [ ] 0.10
- [x] 0.20
- [ ] 0.30
- [ ] 0.40

In [48]:
rouge_2_scores = [rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]['rouge-2']['f'] for r in df[['answer_llm', 'answer_orig']].to_dict(orient='records')]


In [50]:
avg_rouge_2_scores =  np.sum(rouge_2_scores) / len(rouge_2_scores)

avg_rouge_2_scores

0.20696501983423318