# **02 - LLM Evaluation Part 1**

In this tutorial, we'll take a **practical look at the various metrics we've seen in the course**.  
We'll be looking at:
- traditional, 
- embedding, 
- trained,
- LLM-assisted metrics.

Optional: then, we'll see how to **implement a validation loop during training to compute perplexity** (file: `02_Evaluation_part2.ipynb`). To do this, we'll once again use the **Phi-2 Decoder model** with the same data (roleplay).

**Uncomment the following cell on Jean-Zay only** (no internet access)

In [None]:
# import os

# cache_path = os.environ['WORK'] + "/cache_spellm"
# os.environ['TRANSFORMERS_CACHE'] = cache_path
# os.environ['HF_HOME'] = cache_path
# os.environ['HF_DATASETS_CACHE'] = cache_path
# os.environ['TORCH_HOME'] = cache_path

In [None]:
import os
import random
import json
from pathlib import Path

import datasets
import pandas as pd
import torch
from bert_score import BERTScorer
from detoxify import Detoxify
from sentence_transformers import SentenceTransformer, util
from torchmetrics.text import BLEUScore, CHRFScore, ROUGEScore
from transformers import (AutoModelForCausalLM, AutoModelForSeq2SeqLM,
                          AutoModelForSequenceClassification, AutoTokenizer)
from langchain.prompts import PromptTemplate
from vllm import LLM, SamplingParams
from utils import seed_everything
from jupyterquiz import display_quiz


DSDIR = Path(os.environ["DSDIR"])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
pd.set_option('display.max_colwidth', None)  # or 199
seed_everything(53)

---
## **Discover metrics in practice**

We'll look at a few examples of several of the metrics presented in the course.

We will use the example seen in the course and look at the similarity of the reference with several candidates.

In [None]:
# the reference sentence
reference = "Bud Powell was a legendary pianist."

In [None]:
# candidates sentences
candidate_1 = "Bud Powell was a legendary pianist."
candidate_2 = "Bud Powekl was a legendary pianist."                    # a small spelling error: Powekl
candidate_3 = "Bud Powell was a historic piano player."
candidate_4 = "Legendary in the realm of pianists was Bud Powell."
candidate_5 = "Bud Powell, the American, was a legendary pianist."
candidate_6 = "Bud Powell was a New Yorker."
candidate_7 = "Bud Powell, the French, was a legendary pianist."
candidate_8 = "Jimi Hendrix was an American guitarist."                # almost nothing to do with the reference, but about music and an American

candidates = [candidate_1, candidate_2, candidate_3, candidate_4, candidate_5, candidate_6, candidate_7, candidate_8]
references = [reference] * len(candidates)

Candidates were sorted by default in **order of semantic similarity based on human judgement** (of Thomas). 

If you have any ideas in mind, **don't hesitate to add examples to this list**.

### **Traditional metrics**

#### **BLEU, ROUGE, chrF**

First, we will use the `BLEU`, `ROUGE` and `chrF` score. Good implementations of these metrics are available in `torchmetrics`.

In [None]:
# Loading metrics on GPU
bleu_metric = BLEUScore().to("cuda")
rouge_metric = ROUGEScore().to("cuda")
chrf_metric = CHRFScore().to("cuda")

traditional_metrics = [bleu_metric, rouge_metric, chrf_metric]

Let's calculate the scores using these metrics.

The `torchmetrics` take a list of candidates and references as parameters and return a single score for all candidates. We want to have the score for each candidate, so we'll pass each item to the metric.  
Furthermore, for a candidate, these metrics can take on several references.

Here's an example with BLEU.

In [None]:
# bleu_metric(candidates, references)
score = bleu_metric([candidate_1], [[reference]])
score.item()  # item() retrieves the value inside the returned tensor

Candidate 1 and the reference are identical, so the score is 1. 
We're going to loop back to calculate all the scores for all the examples and fill in a dataframe.

Furthermore, for ROUGE, the implementation returns different scores because there are several possible variants. We'll use **ROUGE-L Recall** here. The metric looks for the longest common subsequence that appear in the same order between the candidate and the reference and divides its length by the number of words in the reference.

In [None]:
rouge_metric([candidate_1], [[reference]])

The function for generating the results for the examples shown in a Dataframe is given below. The idea is to analyse the results to make sure you have understood these metrics.

In [None]:
def compute_scores_traditional_metrics(reference, candidates, metrics):
    data_df = [
        [candidate] + [
            score['rougeL_recall'].item() if type(metric).__name__ == "ROUGEScore" else score.item()
            for metric, score in zip(metrics, [metric([candidate], [[reference]]) for metric in metrics])
        ]
        for candidate in candidates
    ]
    
    return pd.DataFrame(data_df, columns=["Candidate"] + [type(metric).__name__ for metric in metrics])

Let's generate the results of the metrics on our examples.

In [None]:
results_df = compute_scores_traditional_metrics(reference, candidates, traditional_metrics)
results_df

We find a result from the course for the BLEU score: `Bud Powell was a New Yorker.` is rated more similar than `Bud Powell was a historic piano player`.

You can also see that chrF is tolerant of spelling mistakes.

---
**A few questions about these results.**

- The ROUGE score may seem more interesting than the BLEU score on these examples. However, it notes the examples `Bud Powell, the American, was a legendary pianist.` and `Bud Powell, the French, was a legendary pianist.` with a score of 1. The second candidate is completely wrong. He was not French. Why do you think that is?

<details>
<summary><b>Answer</b></summary>
ROUGE is recall-oriented. This means that it checks that the content of the reference is present in the candidate. Here with a ROUGE-L, which looks for the longest sequence appearing in the 2 sentences, we get `Bud Powell was a legendary pianist`. The candidates are therefore considered perfect.
</details>

- Why do you think none of the metrics do well on the example `Legendary in the realm of pianists was Bud Powell`?

<details>
<summary><b>Answer</b></summary>
Although this example is semantically very close to the reference, it shares too few n-grams in common with the reference. This is the limit of traditional metrics.
</details>

### **Embedding metrics**

#### **BERTScore**

Let's now take a look at what **BERTScore** has to say about these examples, in the hope that it will be more accurate in terms of human judgement.  
A BERTScore implementation is available in `torchmetrics` but it doesn't seem to be very accurate. We will use the official version.

The model behind will be `deberta-large-mnli`. It has a correlation with human judgement on translation tasks of 77%. The default model, bert-base-uncased, has a correlation of 69%. The list of available models and their correlation is available here: https://docs.google.com/spreadsheets/d/1RKOVpselB98Nnh_EOC4A2BYn8_201tmPODpNWu4w7xI/edit#gid=0. You will also find a `Max Length` column corresponding to the maximum length of the input, the rest is truncated. Depending on your tasks, you may need to pay attention to this parameter. The same applies to Sentence Transformers.

In [None]:
# Initialize BERTScorer
bert_scorer = BERTScorer(model_type="microsoft/deberta-large-mnli")  # deberta-large-mnli pearson correlation: 0.7736

# Compute BERTScore on our examples
P, R, F1 = bert_scorer.score(candidates, references)

# Add results to our dataframe
results_df['BERT Precision'] = P
results_df['BERT Recall'] = R
results_df['BERT F1'] = F1

In [None]:
results_df

What are your thoughts on this? 
Although the results are not perfect, it seems **more consistent with human judgement**. The candidate `Bud Powell, the French, was a legendary pianist` is penalised more than before.  

The precision penalizes this example. We'll see much better the difference between BERTScore precision and recall when we evaluate our trained model on the use case.

### **Trained metrics**

We'll now take a look at a few trained metrics.

#### **SentenceTransformers**

We're now going to use SentenceTransformers. These are models trained to obtain dense representations of sentences. You can then calculate a similarity cosine, as shown in the next cell.

For the example, we'll use the following model: `mixedbread-ai/mxbai-embed-large-v1`. To choose a Sentence Transformer, we can use the following leaderboard: **https://huggingface.co/spaces/mteb/leaderboard**.  
Informations about `mixedbread-ai/mxbai-embed-large-v1`:
- At the time of this model's selection (April 8, 2024), **it was ranked 9th overall for the Enghlish language, and 4th for STS** (Semantic Textual Similarity) tasks.  
- **Lightweight (1.34 GB, 335M parameters)**
- **Does not require the use of an external API**, as is the case for several of the higher-ranked SentenceTransformers in this benchmark.  
- **Maximum entry: 512 tokens**. Beyond that, your sentences will be truncated. If you have long examples, other templates may be more appropriate.

Note: **the MTEB benchmark has the advantage of being broken down into several tasks and ranking for several languages, including French!** An example with French will follow.

In [None]:
model = SentenceTransformer(str(DSDIR / "HuggingFace_Models/mixedbread-ai/mxbai-embed-large-v1"))

# Compute embedding for both lists: reference and candidates
reference_embedding = model.encode(reference, convert_to_tensor=True)
candidates_embedding = model.encode(candidates, convert_to_tensor=True)
reference_embedding = reference_embedding.expand_as(candidates_embedding)

print(f"Size of our sentence embeddings: {reference_embedding.shape}")

# Compute cosine-similarities
cosine_scores = util.cos_sim(reference_embedding, candidates_embedding)
cosine_scores = cosine_scores[0].cpu().numpy()

results_df['SentenceTransformer'] = cosine_scores
results_df

The results diverge slightly from BERTScore. They are more spread out. BERTScore seems more robust on the typo in our example, while the SentenceTransformer does very well in understanding the meaning behind the sentence: `Legendary in the realm of pianists was Bud Powell`.  

Now let's look at a **French example with the SentenceTransformer `mxbai-embed-large-v1` and `sentence-camembert-large`**, which performs very well in French.

In [None]:
french_references = ["Un homme fait un lit.", "Fin des travaux de démolition du Don Valley Stadium", "Chute du pétrole dans le commerce asiatique", "Le nouveau film est génial"]
french_candidates = ["Une femme joue de la guitare.", "Les travaux de démolition du stade vont commencer", "Baisse des prix du pétrole dans le commerce asiatique", "Le nouveau film est vraiment super"]

First, similarity scores are calculated using `mxbai-embed-large-v1`.

In [None]:
model = SentenceTransformer(str(DSDIR / "HuggingFace_Models/mixedbread-ai/mxbai-embed-large-v1"))

# Compute embedding for both lists: reference and candidates
reference_embedding = model.encode(french_references, convert_to_tensor=True)
candidates_embedding = model.encode(french_candidates, convert_to_tensor=True)
print(f"Size of our sentence embeddings: {reference_embedding.shape}")

# Compute cosine-similarities
cosine_scores = util.cos_sim(reference_embedding, candidates_embedding)

# Plot results
df_data = {"": french_candidates}  # Création d'un dictionnaire avec les candidats
for i, ref in enumerate(french_references):
    df_data[ref] = [cosine_scores[i][j].item() for j in range(len(french_candidates))]
df_results = pd.DataFrame(df_data)
df_results

Then with our second model.

In [None]:
model = SentenceTransformer(str(DSDIR / "HuggingFace_Models/dangvantuan/sentence-camembert-large"))

# Compute embedding for both lists: reference and candidates
reference_embedding = model.encode(french_references, convert_to_tensor=True)
candidates_embedding = model.encode(french_candidates, convert_to_tensor=True)

print(f"Size of our sentence embeddings: {reference_embedding.shape}")

# Compute cosine-similarities
cosine_scores = util.cos_sim(reference_embedding, candidates_embedding)

# Plot results
df_data = {"": french_candidates}  # Création d'un dictionnaire avec les candidats
for i, ref in enumerate(french_references):
    df_data[ref] = [cosine_scores[i][j].item() for j in range(len(french_candidates))]
df_results = pd.DataFrame(df_data)
df_results

Normally, the results appear more coherent with human judgment.

#### **Detoxify**

Now let's look at other metrics in practice, using completely different examples.  
Detoxify is a python module for estimating the toxicity of a sentence.
The metric evaluates several criteria:
- `toxicity`, `severe_toxicity`, `identity_attack`, `obscene`, `insult`, `threat`, `sexual_explicit`

Scores are between 0 and 1. Here is an example of use:

In [None]:
sentences = [
    "Bonjour, comment allez-vous ?",
    "Taisez-vous",
]

detoxifier = Detoxify(
    'multilingual',  # xlm-roberta-base model
    device="cuda")
results = detoxifier.predict(sentences)
results

You can vary the examples and try to find biases. They are rare, but they can happen, especially when you change languages.

### **LLM-assisted metrics**

For this practical work, we will use the open-source LLM: **`prometheus-eval/prometheus-7b-v2.0`** based on Mistral-7B. This very recent model was announced in the paper `Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models` on 2 May 2024.  
Paper link: `https://arxiv.org/pdf/2405.01535`

This LLM has been **specially finetuned to carry out evaluations and to be an open alternative to GPT-4**. Another version exists: `Mixtral-Instruct-8x7B` which would expect a 79.15% correlation with GPT-4 Turbo judgements.

The LLM is specialized in 2 tasks:
- **direct assessment** where a score between 1 and 5 is predicted,
- **pairwise ranking**

#### **Loading the model**

Now, we're going to load the trained model.  
To do this we will use **vLLM**. This is an **easy-to-use tool for accelerated inference**. This is much faster than the `from_pretrained` method. 
**vLLM will be explained in a next part of the course**.

In [None]:
llm = LLM(model=str(DSDIR / "HuggingFace_Models/prometheus-7b-v2.0"), gpu_memory_utilization=0.75)
sampling_params = SamplingParams(max_tokens=1000)

#### **Direct assessment**

Prometheus requires 4 components in the input: 
- an instruction (often the question),
- a response to evaluate,
- a score rubric with a score description for each score in range of 1 to 5,
- a *optional* reference answer.

Below is the prompt template proposed in the paper. According to the authors of the paper, this is the optimal prompt based on the training carried out.
To complete our variables, we are going to use the `PromptTemplate` from `Langchain`.

In [None]:
# Initialize the template
prompt_absolute_grading = PromptTemplate.from_template("""You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback: 
""")

Once the prompt has been initialized, you can fill in the variables between `{}` simply as in the next cell.

In our case, we'll imagine the following situation.
Let's imagine that our criterion is as follows: **is the content understandable by a 5-year-old child?**
We are going to give him the content to be evaluated as well as the instruction to which this content responds. We'll also give a slightly more detailed description of each of the scores from 1 to 5.

Here is an example of how to fill the template.

In [None]:
prompt = prompt_absolute_grading.format(
    orig_instruction="How is electricity generated?",
    orig_response="Electricity is generated through the movement of electrons. At the atomic level, electrons orbit the nucleus of atoms. In certain materials, such as metals, electrons are loosely bound to their atoms and can move freely. When a potential difference, or voltage, is applied across a conductor (such as a wire), it creates an electric field. This electric field exerts a force on the free electrons, causing them to move in a particular direction.",
    orig_criteria="Child understanding - determine whether the response to the instruction is comprehensible to a 5-year-old child",
    orig_score1_description="The model's response is completely incomprehensible. The response uses complex vocabulary and concepts that a 5-year-old cannot understand. The concepts mentioned require a university level education.",
    orig_score2_description="The model's response is largely incomprehensible. The response has some simpler words, but overall, it is too complex and confusing for a 5-year-old.",
    orig_score3_description="The model's response is partially comprehensible. The response has some understandable parts, but there are still several words and ideas that are too advanced for a 5-year-old.",
    orig_score4_description="The model's response is mostly comprehensible with only a few words or concepts that might be slightly challenging for a 5-year-old.",
    orig_score5_description="The model's response is perfectly comprehensible. The response is completely clear and easy for a 5-year-old to understand, using simple words and concepts appropriate for their age."
)
print(prompt)

Now let's generate the answer using our model.

In [None]:
result = llm.generate(prompt, sampling_params, use_tqdm=False)
result[0].outputs[0].text

**Normally the score is relatively low**. Feel free to modify the prompt and examples if others come to mind !

#### **Pairwise ranking**

Prometheus requires 4 components in the input: 
- an instruction (often the question),
- 2 responses to evaluate,
- a score rubric,
- a *optional* reference answer.

We are also starting from the prompt template proposed in the paper, which we are going to complete.

In [None]:
# Initialize the template
prompt_pairwise_ranking = PromptTemplate.from_template("""You are a fair judge assistant assigned to deliver insightful feedback that compares individual performances, highlighting how each stands relative to others within the same cohort.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.

###Instruction:
{orig_instruction}

###Response A:
{orig_response_A}

###Response B:
{orig_response_B}

###Score Rubric:
{orig_criteria}

###Feedback: 
""")

We continue with the same example as the previous one and this time we compare two answers to find out which is the most comprehensible to a 5-year-old child.

In [None]:
prompt = prompt_pairwise_ranking.format(
    orig_instruction="How is electricity generated?",
    orig_response_A="Electricity is generated through the movement of electrons. At the atomic level, electrons orbit the nucleus of atoms. In certain materials, such as metals, electrons are loosely bound to their atoms and can move freely. When a potential difference, or voltage, is applied across a conductor (such as a wire), it creates an electric field. This electric field exerts a force on the free electrons, causing them to move in a particular direction.",
    orig_response_B="Electricity can be generated using wind turbines, dams, nuclear power stations or fossil fuels such as oil.",
    orig_criteria="Child understanding - determine whether the response to the instruction is comprehensible to a 5-year-old child",
)
print(prompt)

And finally, the answer from Prometheus 2.

In [None]:
result = llm.generate(prompt, sampling_params, use_tqdm=False)
result[0].outputs[0].text

**Normally the preferred answer is B** because it's easier for a child to understand, although it's not perfect.