Homework: Evaluation and Monitoring

Getting the data

In [2]:
import pandas as pd

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/results-gpt4o-mini.csv'
github_url =  f'{base_url}/{relative_url}?raw=1'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [3]:
df = df.iloc[:300]

In [4]:
df.iloc[0]

answer_llm     You can sign up for the course by visiting the...
answer_orig    Machine Learning Zoomcamp FAQ\nThe purpose of ...
document                                                0227b872
question                     Where can I sign up for the course?
course                                 machine-learning-zoomcamp
Name: 0, dtype: object

In [6]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


Q1. Getting the embeddings model

Create the embeddings for the first LLM answer:

answer_llm = df.iloc[0].answer_llm

What's the first value of the resulting vector?

-0.42

In [8]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [9]:
answer_llm = df.iloc[0].answer_llm

In [10]:
v = embedding_model.encode(answer_llm)

In [11]:
v[0]

-0.42244676

Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

-31.67


In [12]:
from tqdm.auto import tqdm

In [13]:
results_gpt4o_mini = df.to_dict(orient='records')

In [14]:
record = results_gpt4o_mini[0]

In [15]:
def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [17]:
evaluations = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_similarity(record)
    evaluations.append(sim)

100%|██████████| 300/300 [06:53<00:00,  1.38s/it]


In [19]:
df['cosine'] = evaluations
df['cosine'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547923
25%       24.307845
50%       28.336875
75%       31.674310
max       39.476013
Name: cosine, dtype: float64

Q3. Computing the cosine

To normalize the vectors
-Compute the norm of a vector
-Divide each element by this norm
-So, for vector v, it'll be v / ||v||

What's the 75% cosine in the scores?

-0.83

In [20]:
import numpy as np

In [22]:
def compute_similarity_norm(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)

    llm_norm = np.sqrt((v_llm * v_llm).sum())
    v_llm_norm = v_llm / llm_norm

    orig_norm = np.sqrt((v_orig * v_orig).sum())
    v_orig_norm = v_orig / orig_norm

    
    return v_llm_norm.dot(v_orig_norm)

In [23]:
evaluations_norm = []

for record in tqdm(results_gpt4o_mini):
    sim = compute_similarity_norm(record)
    evaluations_norm.append(sim)

100%|██████████| 300/300 [07:35<00:00,  1.52s/it]


In [24]:
df['cosine_2'] = evaluations_norm
df['cosine_2'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine_2, dtype: float64

Q4. Rouge

(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

rouge-1 - the overlap of unigrams,
rouge-2 - bigrams,
rouge-l - the longest common subsequence

What's the F score for rouge-1?

-0.45

In [28]:
from rouge import Rouge
rouge_scorer = Rouge()

a = df.iloc[10]['answer_llm']
b = df.iloc[10]['answer_orig']
scores = rouge_scorer.get_scores(a,b)[0]
df_scores = pd.DataFrame(scores)

df_scores

Unnamed: 0,rouge-1,rouge-2,rouge-l
r,0.454545,0.216216,0.393939
p,0.454545,0.216216,0.393939
f,0.454545,0.216216,0.393939


Q5. Average rouge score

Let's compute the average between rouge-1, rouge-2 and rouge-l for the same record from Q4
-0.35

In [29]:
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3

In [30]:
rouge_avg

0.35490034990035496

Q6. Average rouge score for all the data points
Now let's compute the score for all the records

What's the average rouge_2 across all the records?

-0.20

In [34]:
import statistics
from rouge import Rouge
rouge_scorer = Rouge()

x = []

for i in range(300):

    a = df.iloc[i]['answer_llm']
    b = df.iloc[i]['answer_orig']

    scores = rouge_scorer.get_scores(a,b)[0]
    rouge_2_old = scores['rouge-2']['f']
    x.append(rouge_2_old)

y = pd.DataFrame(x)
print(y)
avg_rouge_2 = statistics.mean(x)
print(avg_rouge_2)


            0
0    0.028169
1    0.055556
2    0.177778
3    0.047059
4    0.033898
..        ...
295  0.540984
296  0.460432
297  0.564516
298  0.132231
299  0.023529

[300 rows x 1 columns]
0.20696501983423318
