# LLM Zoomcamp 2024 - Session #4 - Homework

Author: José Victor

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)

Read it:
```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:
```python
df = df.iloc[:300]
```

In [1]:
import numpy as np
import pandas as pd

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

* Note: this is not the same model as in HW3

```python
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:
```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* (X) -0.42
* ( ) -0.22
* ( ) -0.02
* ( ) 0.21

In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

In [3]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [4]:
answer_llm = df.iloc[0].answer_llm
v_llm = embedding_model.encode(answer_llm)
print(v_llm[0])

-0.42244655


## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (score) into the `evaluations` list

What's the 75% percentile of the score?

* ( ) 21.67
* (X) 31.67
* ( ) 41.67
* ( ) 51.67

In [5]:
embeddings_llm = [embedding_model.encode(df.iloc[i].answer_llm) for i in range(df.shape[0])]
embeddings_orig = [embedding_model.encode(df.iloc[i].answer_orig) for i in range(df.shape[0])]

In [6]:
evaluations = [v_llm.dot(v_orig) for v_llm, v_orig in zip(embeddings_llm, embeddings_orig)]
print(np.percentile(evaluations, 75))

31.67430877685547


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity.

What's the 75% cosine in the scores?

* ( ) 0.63
* ( ) 0.73
* (X) 0.83
* ( ) 0.93

In [17]:
embeddings_llm_normalized = [array/(np.sqrt((array*array).sum())) for array in embeddings_llm]
embeddings_orig_normalized = [array/(np.sqrt((array*array).sum())) for array in embeddings_orig]

In [18]:
evaluations = [v_llm.dot(v_orig) for v_llm, v_orig in zip(embeddings_llm_normalized, embeddings_orig_normalized)]
print(np.percentile(evaluations, 75))

0.8362348973751068


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```
(The lastest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE socre between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```python
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

* ( ) 0.35
* (X) 0.45
* ( ) 0.55
* ( ) 0.65

In [19]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df.iloc[10]
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [21]:
scores['rouge-1']['f']

0.45454544954545456

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

* (X) 0.35
* ( ) 0.45
* ( ) 0.55
* ( ) 0.65

In [22]:
rouge_avg_q4 = np.mean([scores[score_type]['f'] for score_type in ['rouge-1', 'rouge-2', 'rouge-l']])
print(rouge_avg_q4)

0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the average `rouge_2` across all the records?

* ( ) 0.10
* (X) 0.20
* ( ) 0.30
* ( ) 0.40

In [24]:
data = []
columns = ["rouge_1", "rouge_2", "rouge_l", "rouge_avg"]

for i in range(df.shape[0]):
    r = df.iloc[i]
    scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
    aux = [scores[score_type]['f'] for score_type in ['rouge-1', 'rouge-2', 'rouge-l']]
    aux.append(np.mean([scores[score_type]['f'] for score_type in ['rouge-1', 'rouge-2', 'rouge-l']]))
    data.append(aux)

df_scores = pd.DataFrame(data=data, columns=columns)
print(df_scores.rouge_2.mean())

0.20696501983423318
