## Homework: Evaluation and Monitoring
In this homework, we'll evaluate the quality of our RAG system.

In [20]:
import pandas as pd
import numpy as np

## Getting the data
Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini

In [3]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [4]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [5]:
df = df.iloc[:300]

## Q1. Getting the embeddings model
Now, get the embeddings model multi-qa-mpnet-base-dot-v1 from the Sentence Transformer library
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview

In [6]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [7]:
model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

Create the embeddings for the first LLM answer:

In [8]:
answer_llm = df.iloc[0].answer_llm

In [9]:
answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [11]:
embedding = embedding_model.encode(answer_llm)

In [12]:
embedding[0]

-0.42244655

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?

In [15]:
answer_orig = df['answer_orig'][0]
answer_llm = df['answer_llm'][0]

v_orig = embedding_model.encode(answer_orig)
v_llm = embedding_model.encode(answer_llm)

v_orig.dot(v_llm)

17.515987

In [16]:
df['orig_vector'] = df['answer_orig'].apply(lambda x: embedding_model.encode(x))

In [18]:
df['llm_vector'] = df['answer_llm'].apply(lambda x: embedding_model.encode(x))

In [19]:
df['dot_product'] = df.apply(lambda row: row['orig_vector'].dot(row['llm_vector']), axis=1)

In [21]:
df['dot_product'].quantile(0.75)

31.67430877685547

## Q3. Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

* Compute the norm of a vector
* Divide each element by this norm

In [22]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    return v / norm

In [23]:
df['orig_vector_norm'] = df['orig_vector'].apply(normalize_vector)

In [24]:
df['llm_vector_norm'] = df['llm_vector'].apply(normalize_vector)

In [25]:
df['dot_product_norm'] = df.apply(lambda row: row['orig_vector_norm'].dot(row['llm_vector_norm']), axis=1)

In [27]:
df['dot_product_norm'].quantile(0.75)

0.8362348973751068

## Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [28]:
# !pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [29]:
from rouge import Rouge
rouge_scorer = Rouge()

In [30]:
r = df.iloc[10]
r

answer_llm          Yes, all sessions are recorded, so if you miss...
answer_orig         Everything is recorded, so you won’t miss anyt...
document                                                     5170565b
question                         Are sessions recorded if I miss one?
course                                      machine-learning-zoomcamp
orig_vector         [-0.22097382, -0.07662514, -0.19240223, -0.038...
llm_vector          [-0.10797262, -0.07068468, -0.091208436, 0.092...
dot_product                                                 32.344711
orig_vector_norm    [-0.03465839, -0.012018184, -0.030177113, -0.0...
llm_vector_norm     [-0.016557612, -0.010839502, -0.013986822, 0.0...
dot_product_norm                                             0.777956
Name: 10, dtype: object

In [31]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

What's the F score for rouge-1?

In [33]:
scores['rouge-1']['f']

0.45454544954545456

## Q5. Average rouge score
Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4

In [35]:
f_scores = []
for key in scores:
    f_scores.append(scores[key]['f'])

In [36]:
avg_f_scores = sum(f_scores) /len(scores)

In [37]:
avg_f_scores

0.35490034990035496

## Q6. Average rouge score for all the data points
Now let's compute the score for all the records and create a dataframe from them.

What's the average rouge_2 across all the records?

In [39]:
df['rouge_2_f'] = df.apply(lambda row: rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]['rouge-2']['f'], axis=1)

In [40]:
np.mean(df['rouge_2_f'])

0.20696501983423318