# **07 - LLM Evaluation**

In this hands-on work, we are going to **evaluate a finetuned model for assistance**.  
The foundation model is **phi-2** and the **dataset used for finetuning is Bitext** (https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset). The model has been trained over **3 epochs in the same way as you did on the first day as part of the `roleplay`**.

To perform this evaluation, we will need to **create an evaluation dataset**. The dataset will be of the "supervised" type, with a **prompt as input and the expected response as output**.  
We'll see how this can be constructed from our data. Then we'll choose a **few metrics** and **write our evaluation loop**. 

**Note:** to simplify this evaluation, we will not be using an LLM-assisted metric for this practical work. These metrics could of course be used here.

**Uncomment the following cell for Jean-Zay only** (no internet access)

In [None]:
# import os

# cache_path = os.environ['WORK'] + "/cache_spellm"
# os.environ['TRANSFORMERS_CACHE'] = cache_path
# os.environ['HF_HOME'] = cache_path
# os.environ['HF_DATASETS_CACHE'] = cache_path
# os.environ['TORCH_HOME'] = cache_path

In [None]:
import os
import random
import json
import pandas as pd
from pathlib import Path

import datasets
import torch
from bert_score import BERTScorer
from detoxify import Detoxify
from torch.utils.data import DataLoader, Dataset, IterableDataset
from torchmetrics import Metric
from tqdm.notebook import tqdm
from torchmetrics.text import CHRFScore, BERTScore
from utils import seed_everything, write_results_to_file
from vllm import LLM, SamplingParams
from langchain.prompts import PromptTemplate
from jupyterquiz import display_quiz


quiz_path = Path("./quiz/evaluation.json")
quiz = json.loads(quiz_path.read_text())

DSDIR = Path(os.environ["DSDIR"])
DATASET_PATH = DSDIR / "HuggingFace/bitext/Bitext-customer-support-llm-chatbot-training-dataset"
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
seed_everything(53)

**Path of the evaluated model:**

In [None]:
MODEL_PATH = Path(DSDIR / "data_spellm/finetuned-phi-2")  # finetuned

---

## **Discover the dataset**

The `Bitext` dataset is a question and answer dataset for developing virtual assistants.  
Each element is in particuliar composed of:
- `instruction`: a user request from the Customer Service domain,
- `category`: the type of request (for example: ORDER, REFUND...),
- `item`: a more precise sub-category (for example: cancel_order, change_order or place_order for ORDER),
- `response`: an example expected response from the virtual assistant.

In [None]:
dataset = datasets.load_dataset(str(DATASET_PATH))['train'].select(range(20000))  # We select the first 20,000 elements of the dataset.
dataset

We split our train/test set.

In [None]:
dataset = dataset.train_test_split(train_size=0.95)

In [None]:
# Use this cell to answer the following question

In [None]:
display_quiz([quiz[2]])

In [None]:
test_dataset = dataset["test"]

print(f"The dataset contains {len(test_dataset)} samples.")

It is important to note that for the training, **the same training set was used (same seed). Thus, our model never saw our test data.** Note that our train_test_split can be improved. In our evaluation set, we may have categories that we have never seen in the training set for example.

Take a look at some examples:

In [None]:
test_dataset[500]

---

## **Creation of the evaluation dataset**

Creating the evaluation set is a **very important step**. This evaluation set **must be similar to the scenarios encountered once they have been put into production**.  

During training, the examples were concatenated with a certain template, as follows:
```
# Category: {category}
# Intent: {intent}
<user_request>: {instruction}
<system_response>: {response}
```

For our evaluation, **we choose to assess the assistant's responses by comparing them with the given reference**.
The input will therefore take the following form:
```
# Category: {category}
# Intent: {intent}
<user_request>: {instruction}
```

For the output, we will expect the LLM to have `<system_response>:` as well as a response whose meaning is similar to the given response. The expected output will therefore be:
```
<system_response>: {response}
```

We're going to **create this PyTorch dataset**.

To complete our variables, we are going to use the `PromptTemplate` from `Langchain`. Here's how to use them:

In [None]:
# Initialize the template
test_prompt_template = PromptTemplate.from_template("""# Category: {category}
# Intent: {intent}
<user_request>: {instruction}"""
)

In [None]:
# Fill the template with `format` function
print(
    test_prompt_template.format(
        category="the category", 
        intent="the intent", 
        instruction="the instruction")
)

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to complete the code for the `EvalBitextDataset` class. It will output the input shown above and the expected answer. Do not apply tokenization.  Indeed, our model will take strings directly as input, and the metrics also need strings directly.

In [None]:
class BitextDataset(Dataset):
    def __init__(self, hf_dataset, prompt_template):
        self.hf_dataset=hf_dataset
        self.prompt_template = prompt_template

    def __len__(self):
        ############ Complete here ############
        return 
        #######################################
        
    def __getitem__(self, idx):
        """Return the input (prompt) for the model and the associated expected answer"""
        ############ Complete here ############
        element = 

        inp = 
        target=
        #######################################

        return inp, target

**Solution:**

Our evaluation dataset has now been created. We can initialize it with the following code.

In [None]:
dataset = BitextDataset(
    hf_dataset=test_dataset, 
    prompt_template=test_prompt_template)

View a few examples manually and think about possible areas for improvement for this dataset.

In [None]:
inp, target = dataset[20]

In [None]:
print(inp)

In [None]:
print(target)

After looking at a few examples, you can see that some instructions and responses **contain placeholders (`{{ }}`)**. We might wonder how our model would react if these placeholders were replaced by values. For this tutorial, we won't go any further on this subject.

### **Dataloader**

Now let's create a dataloader for batch evaluation. We'll give our LLM batches of prompts. We will then compare the responses generated with those expected.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to write the function that formats the batches correctly. This is called `collate`. It receives a list of elements as input. There are `batch_size` elements. These elements contain a prompt and the expected response. As output, we want to have **two lists of strings**: the first containing the prompts and the second containing the expected responses.

In [None]:
def collate(batch):

**Solution:**

Initialize the dataloader.

In [None]:
display_quiz([quiz[3]])

In [None]:
eval_dataloader = DataLoader(
    dataset,
    batch_size=8,  # Warning: If you have a memory error during evaluation, lower batch_size to 4.
    num_workers=1,
    prefetch_factor=2,
    collate_fn=collate,
    shuffle=False
)

Note: by default, the DataLoader `collate_fn` does exactly what we did in the `collate` function. We could have done without it here.

You can take a look at a few batches.

In [None]:
b0 = next(iter(eval_dataloader))

In [None]:
b0[0]

## **Loading the model**

Now, we're going to load the trained model.  
To do this we will use **vLLM**. This is an **easy-to-use tool for accelerated inference**. This is much faster than the `from_pretrained` method. 
**vLLM will be explained in the next part of the course**.

Even so, text generation is **time-consuming**. To limit this impact during the practical work, **we are limiting the number of output tokens to 100**.

In [None]:
llm = LLM(model=str(MODEL_PATH), gpu_memory_utilization=0.75, seed=53)
sampling_params = SamplingParams(max_tokens=100)

You can generate a few answers based on examples of your choice. Just change the value of the `idx` variable. The prompt and the generated text will be displayed.

In [None]:
# Change idx to test on multiple samples
idx = 1

prompt = dataset[idx][0]
print(f"{'*'*50} Prompt {'*'*50}\n{prompt}\n\n")

result = llm.generate(prompt, sampling_params, use_tqdm=False)

print(f"{'*'*50} Answer generated {'*'*50}\n{result[0].outputs[0].text}\n\n")

print(f"{'*'*50} Reference answer {'*'*50}\n{dataset[idx][1]}")

**Note**: you'll sometimes see a `<user_request>` again in LLM-generated responses after the generated response. This is logical given its training. We could prevent the LLM from continuing to generate text as soon as it sees the <user_request> template again. We won't do this in this tutorial.

## **Evaluation loop**

Now we can create our evaluation loop.  

We will **batch-generate the responses to the prompt and calculate the metrics as we go along**. For each metric, an internal counter will show the overall score.
**Another approach is to generate all the prompt results at once and then calculate the metrics**. In our case, this could work, but it can **require more memory space** to process them after.

The structure of the loop will therefore look like this:
```python
loop = tqdm(eval_dataloader)

for i, (prompts, targets) in enumerate(loop):
    # generation of responses from prompts

    # computation of metrics from generated responses and expected responses if needed
    # obtaining scores for the current batch from metrics

# obtaining overall scores from metrics
```


For the evaluation, we will use the following metrics:
- **chrF**,
- **BERTScore**,
- **Detoxify** to assess the toxicity of the responses generated. We want to ensure that our model does not respond to users with inappropriate language.

The `torchmetrics` metrics work in batch mode by default. The implementation of chrF comes from this metric, so it's OK. 
  
BERTScore is also implemented in torchmetrics but its implementation seems unreliable (the traditional metrics in torchmetrics are reliable.). We will use the official version and wrap it in a torchmetrics `Metric` object. We'll do the same for `Detoxify`.

As you saw on the first day, this is how BERTScore works. We pass it a list of candidates and references and the metric computes a score for precision, recall and F1.

In [None]:
bert_scorer = BERTScorer(model_type="microsoft/deberta-large-mnli")  # deberta-large-mnli pearson correlation: 0.7736

bert_scorer.score(["a candidate", "another candidate"], ["a good reference", "another reference"])

This will work for one batch, but when we receive a new batch, the previous results will be lost. So we're going to implement some logic to update the metric over time.

A `Metric` from torchmetrics has 2 functions to complete:
- `update` which computes the score for the current batch and maintains the overall score of the metric,
- `compute` which computes the final score for the metric.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> You will need to complete the code for the `BERTScoreTorchMetrics` class. You will need to complete the `update` and `compute` functions. An average of the precision, recall and F1 results must be returned. To do this, you will have access to 3 attributes: `precision`, `recall`, `f1` and `total` (the number of candidates processed).
In the constructor, you can see lines like this:
```python
self.add_state("precision", default=torch.tensor(0.0), dist_reduce_fx="sum")
```
`precision` is an attribute of the class accessible via `self.precision`. The variable is initialized with a tensor having a value of 0.0. `dist_reduce_fx` is useful in distributed environments. Imagine you have batches on different GPUs and you need to group the results. Here `sum` means that with 2 GPUs, for example, the sum of the `self.precision` attributes will be performed. Torchmetrics makes it easy to create metrics in distributed environments.

In [None]:
class BERTScoreTorchMetrics(Metric):
    full_state_update=False  # optimization: the metric state of one batch is completely independent of the state of other batches -> faster to compute
    def __init__(self, model_type):
        super().__init__()

        self.scorer = BERTScorer(model_type=model_type)
        
        self.add_state("precision", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("recall", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("f1", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
        
    
    def update(self, candidates, references):
        ############ Complete here ############

        
        
        
        
        #######################################
   

    def compute(self):
        ############ Complete here ############
        return {
            "precision":
            "recall":
            "f1":
        }
        #######################################


**Solution:**

You can now try the metric.

In [None]:
bertscore_metric = BERTScoreTorchMetrics(model_type="microsoft/deberta-large-mnli").to("cuda")

# bertscore_metric(candidates, references) calls `update` internally
score_1 = bertscore_metric(["a candidate", "another candidate"], ["a good reference", "another reference"])
score_2 = bertscore_metric(["new candidate"], ["new reference"])

print(f"{score_1=}")
print(f"{score_2=}")
print(f"{bertscore_metric.compute()=}")

**Recall can be biaised in our case**. It measures whether all the information in the reference can be found in the candidate answer. However, as we've limited the size of the generated outputs to 100 tokens, there's a good chance that our answers will be incomplete. Nevertheless, if this score is good, it's a good thing in view of our concise answers.  
**Precision** will check whether our candidate's information can be found in our reference.. The **F1 score, which is a compromise between the 2**, is also interesting.

We'll do the same for **toxicity** with `Detoxify`. We'll give you the implementation. The idea is similar.

As a reminder, `detoxify` takes as input only a list of strings and returns several scores: 
- `toxicity`, `severe_toxicity`, `identity_attack`, `obscene`, `insult`, `threat`, `sexual_explicit`.

The implementation will return the **average score for each of these indicators**, as well as the **maximum score**. We don't want our LLM to ever make inappropriate comments.
In addition, **if an output exceeds a threshold (configurable), we'll display it**.

In [None]:
class DetoxifyTorchMetrics(Metric):
    full_state_update=False  # optimization: the metric state of one batch is completely independent of the state of other batches -> faster to compute
    def __init__(self, threshold: float = 0.5):
        super().__init__()
        
        self.detoxify = Detoxify(
            model_type='multilingual',
            device="cuda"
        )
        self.metrics = ["toxicity", "severe_toxicity", "identity_attack", "obscene", "insult", "threat", "sexual_explicit"]
        self.threshold = threshold

        self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")
        
        for metric in self.metrics:
            self.add_state(metric, default=torch.tensor(0.0), dist_reduce_fx="sum")
            self.add_state("max_" + metric, default=torch.tensor(0.0), dist_reduce_fx="max")
        
        
    def update(self, output: str, batch_idx: int = 0):
        results = self.detoxify.predict(output)
        batch_size = len(output)
        
        for metric in self.metrics:
            setattr(self, metric, getattr(self, metric) + sum(results[metric]))
            setattr(self, "max_" + metric, torch.max(getattr(self, "max_" + metric), torch.tensor(max(results[metric]))))
            
            for local_idx, score in enumerate(results[metric]):
                if score > self.threshold:
                    print(f"{'#'*10} \033[1m{metric}\033[0m threshold exceeded | score : {round(score, 2)} {'#'*10}\n {output[local_idx]}")
        
        self.total += len(output)

        
    def compute(self):
        return {metric: getattr(self, metric) / self.total for metric in self.metrics}, {"max_" + metric: getattr(self, "max_" + metric) for metric in self.metrics}

A little try.

In [None]:
detoxify_metric = DetoxifyTorchMetrics(threshold=0.5).to("cuda")

score_1 = detoxify_metric(["bonjour comment ca va ?", "shut up"])
score_2 = detoxify_metric(["nous avons bien reçu votre demande"])

global_score = detoxify_metric.compute()
global_score

Everything is almost ready. All that remains is to **write the evaluation loop**.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to complete the validation loop function. Its signature is: `def eval_loop(model, dataloader, chrf_metric, bertscore_metric, detoxify_metric, sampling_params, test=True, write_results=True)`.

In [None]:
def eval_loop(model, dataloader, chrf_metric, bertscore_metric, detoxify_metric, sampling_params, test=True):
    # reset metrics to initial state
    chrf_metric.reset()
    bertscore_metric.reset()
    detoxify_metric.reset()

    loop = tqdm(eval_dataloader, desc="eval_dataloader")

    for i, (prompts, targets) in enumerate(loop):
        ############ Complete here ############
        outputs = 
        outputs = 
        
        score_chrf = 
        score_bertscore = 
        score_detoxify = 

        loop.set_postfix(
            average_chrf=,
            chrf=,
            average_bertscore=,
            bertscore=
        )
        #######################################

        if i >= 50 and test:
            loop.close()
            break
    
    ############ Complete here ############
    # Toxicity is only displayed at the end of training, as it is too verbose during training.
    detoxify_score = 
    print(f"Average detoxify: {}")
    print(f"Max detoxify: {}")
    #######################################

**Solution:**

Now that our loop has been created. All we have to do is initialize our metrics and put them on the GPU.

In [None]:
chrf_metric = CHRFScore().to("cuda")
bertscore_metric = BERTScoreTorchMetrics(model_type="microsoft/deberta-large-mnli").to("cuda")
detoxify_metric = DetoxifyTorchMetrics(threshold=0.5).to("cuda")

Let's start the evaluation. The results will be displayed in the terminal. In practice, we can log all our results on a platform such as `MLFlow` or `Weights & Biases`.

In [None]:
eval_loop(
    model=llm, 
    dataloader=eval_dataloader, 
    chrf_metric=chrf_metric,
    bertscore_metric=bertscore_metric,
    detoxify_metric=detoxify_metric,
    sampling_params=sampling_params,
    test=True,
)

Well, we've got results for the model we trained. Now let's **compare them with the basic model**.

The basic `phi-2` model cannot be evaluated in this notebook now because almost all the memory is already occupied by the first vLLM and it's complicated to free it up.
We could re-run this exact notebook by changing the model path to a `phi-2` foundation model but **to make it quicker we've provided the results obtained with the foundation model below.**

In [None]:
print(f"Max memory allocated: {torch.cuda.max_memory_allocated(device='cuda') / (1024 ** 3)} GB")

**Results `phi-2`:**
```
CHRFScore: 0.1326485127210617

BERTScoreTorchMetrics:
{
    'precision': 0.4066,
    'recall':0.4033,
    'f1': 0.4010
}

DetoxifyTorchMetrics: (
    {
        'toxicity': 0.0080,
        'severe_toxicity': 0.0006,
        'identity_attack': 0.0003,
        'obscene': 0.0063,
        'insult': 0.0034,
        'threat': 0.0001,
        'sexual_explicit': 0.0016
    },
    {
        'max_toxicity': 0.9953,
        'max_severe_toxicity': 0.2655,
        'max_identity_attack': 0.0319,
        'max_obscene': 0.9910,
        'max_insult': 0.9479,
        'max_threat': 0.0252,
        'max_sexual_explicit': 0.7255
    }
)
```

**Note that to use LLM-assisted metrics**, we would have had to generate all our predictions, save them and then use a second notebook with another LLM evaluator. Two LLMs loaded with vLLM do not pass through a kernel (and practically everywhere). With an evaluator LLM via an API, this could have worked.