## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

In [1]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from rouge import Rouge

  from tqdm.autonotebook import tqdm, trange
2024-07-25 22:25:08.187677: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-25 22:25:08.187714: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-25 22:25:08.188464: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-25 22:25:08.193410: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compil

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [2]:
df = pd.read_csv("results-gpt4o-mini.csv")
df = df.iloc[:300]

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

In [3]:
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1', device="cuda")
answer_llm = df.iloc[0].answer_llm
embedding_vector = embedding_model.encode(answer_llm)
print(f"The first value of the resulting vector is {embedding_vector[0]}")



The first value of the resulting vector is -0.4224467873573303


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

In [4]:
answer_llm = df["answer_llm"]
answer_orig = df["answer_orig"]
embeddings1 = embedding_model.encode(answer_llm)
embeddings2 = embedding_model.encode(answer_orig)
dot_products = np.einsum('ij,ij->i', embeddings1, embeddings2)
print(f"The 75% percentile of the score is {np.percentile(dot_products, 75)}")

The 75% percentile of the score is 31.674309253692627


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

In [5]:
embeddings1 = embeddings1 / np.linalg.norm(embeddings1, axis=1).reshape(-1, 1)
embeddings2 = embeddings2 / np.linalg.norm(embeddings2, axis=1).reshape(-1, 1)
dot_products = np.einsum('ij,ij->i', embeddings1, embeddings2)
print(f"The 75% cosine in the scores is {np.percentile(dot_products, 75)}")

The 75% cosine in the scores is 0.8362348079681396


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

In [6]:
rouge_scorer = Rouge()
score = rouge_scorer.get_scores(df['answer_llm'], df['answer_orig'])[10]
print(f"The F score for 'rouge-1' is {score['rouge-1']['f']}")

The F score for 'rouge-1' is 0.45454544954545456


## Q5. Average rouge score

Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

In [7]:
avg_score = np.mean([score["rouge-1"]["f"], score["rouge-2"]["f"], score["rouge-l"]["f"]])
print(f"The average between rouge-1, rouge-2 and rouge-l for the same record from Q4 is {avg_score}")

The average between rouge-1, rouge-2 and rouge-l for the same record from Q4 is 0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the F-score for all the records and create a dataframe from them.

What's the average F-score in `rouge_2` across all the records?

In [8]:
scores = rouge_scorer.get_scores(df['answer_llm'], df['answer_orig'])
rouge_2 = [score["rouge-2"]["f"] for score in scores]
print(f"The average rouge_2 across all the records is {np.mean(rouge_2)}")

The average rouge_2 across all the records is 0.20696501983423318


## Submit the results

* Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.