# **02 - LLM Evaluation Part 2**

Before starting this tutorial, **stop the kernels you have run so far**. The following image shows where to go. Then click on `Shut Down All`.  
![image](images/stop_kernels.png)

In [None]:
import os
import random
import json
from pathlib import Path

import datasets
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset, IterableDataset
from torchmetrics.text import Perplexity
from tqdm.notebook import tqdm
from transformers import (AutoModelForCausalLM, AutoModelForSeq2SeqLM,
                          AutoModelForSequenceClassification, AutoTokenizer)
from utils import seed_everything
from jupyterquiz import display_quiz


quiz_path = Path("./quiz/evaluation.json")
quiz = json.loads(quiz_path.read_text())

DSDIR = Path(os.environ["DSDIR"])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
seed_everything(53)

---

## **Implementing validation loop using the Phi-2 model to compute Perplexity**

In the roleplay example, we don't have a validation dataset. We will create one. This dataset will not be supervised, i.e. we will not get the expected response for a message. We will therefore not be able to use the previous metrics, which often require ground truth. We will compute **perplexity** during the training.

### **Dataset**
First, let's load our data again.

In [None]:
# Load the dataset
roleplay_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/hieunguyenminh/roleplay", keep_in_memory=True)
roleplay_dataset

For an evaluation we need 3 sets: a **training** dataset, a **validation** dataset and a **final evaluation** dataset.

For the example, we will take **80% of the data for the train**. **20% will be used for the evaluation game**. We will take **20% of the training set for validation**.

In [None]:
roleplay_dataset = roleplay_dataset['train'].train_test_split(train_size=0.8)
ds_test = roleplay_dataset.pop("test")
print(roleplay_dataset)
print(ds_test)

In [None]:
# we take 20% of the train for validation
roleplay_dataset = roleplay_dataset["train"].train_test_split(train_size=0.8)
ds_valid = roleplay_dataset.pop("test")
ds_valid

In [None]:
roleplay_dataset["valid"] = ds_valid
roleplay_dataset["test"] = ds_test

roleplay_dataset

### **Loading the Phi-2 model and its tokenizer**

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForCausalLM.from_pretrained(
    DSDIR / "HuggingFace_Models/microsoft/phi-2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Allow using code that was not written by HuggingFace
    attn_implementation="flash_attention_2",  # Optimize the model with Flash Attention,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/microsoft/phi-2")

We'll take a look at some of the functions we've already written that will be useful.

In [None]:
def generate_text(model, tokenizer, prompt):
    """Generate text from a prompt and print it."""
    model_inp = tokenizer(prompt, return_tensors="pt").to("cuda")
    # the generate() method is a succession of forward (auto-regressive) 
    out = model.generate(input_ids=model_inp["input_ids"], do_sample=False, max_new_tokens=100)
    print(tokenizer.decode(out[0]))

In [None]:
class RoleplayDataset(IterableDataset):

    def __init__(self, tokenizer, hf_dataset, seq_length=1024, nb_tokens=3160542):
        self.tokenizer = tokenizer
        self.separator = tokenizer.eos_token_id  # The token that will seperate different sample
        self.hf_dataset = hf_dataset
        # It will allow us to sample an index of the HF_dataset randomly
        self.idx_iterator = iter(random.sample(range(len(hf_dataset)), len(hf_dataset)))
        self.seq_length = seq_length
        self.nb_tokens = nb_tokens
    
    def __len__(self):
        return self.nb_tokens // self.seq_length

    def get_next_sample(self):
            """Retrieves the next sample from the dataset and tokenize it."""
            idx = next(self.idx_iterator)
            text = self.hf_dataset[idx]["text"]
            return self.tokenizer(text)['input_ids'] + [self.separator]

    def __iter__(self):
        next_sample_ids = None
        all_token_ids = []
        idx = 0

        while idx < self.__len__():
            if next_sample_ids is None:
                next_sample_ids = self.get_next_sample()

            if len(all_token_ids) + len(next_sample_ids) <= self.seq_length:
                # if the next HF_dataset sample can fit in the current dataset sample
                # we add it
                all_token_ids += next_sample_ids
                next_sample_ids = None
                
            else:
                # if the next HF_dataset sample can't fit in the current dataset
                # sample, we add what we can in the dataset sample and then we yield it
                # note: we add one more element compared to seq_length to return to
                # seq_length when generating inputs and targets (see train_collate())
                idx_break = self.seq_length - len(all_token_ids)
                all_token_ids += next_sample_ids[: idx_break + 1]
                next_sample_ids = next_sample_ids[idx_break + 1 :]
                
                model_inp = torch.tensor(all_token_ids[:-1], dtype=torch.int64)
                labels = torch.tensor(all_token_ids[1:], dtype=torch.int64) 
                yield model_inp, labels

                all_token_ids = []
                idx += 1


In [None]:
def count_tokens(hf_dataset, tokenizer):
    total_tokens = 0
    loop = tqdm(hf_dataset)
    # iterate over every element of the dataset
    for element in loop:
        # Count the number of token of one element
        nb_token_element = len(tokenizer(element['text'])["input_ids"])
        # Add the count to the total count
        total_tokens += nb_token_element
        
        loop.set_postfix(tokens_count=total_tokens)
        
    return total_tokens

We now initialize our training dataset.

In [None]:
tokens_train = count_tokens(roleplay_dataset['train'], tokenizer)
seq_length=512
train_dataset = RoleplayDataset(tokenizer, roleplay_dataset['train'], seq_length=seq_length, nb_tokens=tokens_train)
print(len(train_dataset))

We are also taking back the dataloader.

In [None]:
train_dataloader = DataLoader(
    train_dataset,
    batch_size=4,
    num_workers=1,
    prefetch_factor=2,
)

In [None]:
print(f"Number of train batches: {len(train_dataloader)}")

### **Validation dataset**

To evaluate our model, we **need a dataset that does not concatenate the examples for real-life conditions**. We will then use the same logic as in the training dataset, shifting our tokens by 1 between our input and our target. This will allow us to compute **perplexity** on **unseen data**. 

This metric is very fast because it does not require any text to be generated. For metrics that need references, we'll look at this in the second tutorial on evaluation. 

Previously, we concatenated all our examples. So all our elements were the same size. This is not the case here. We're going to have to do some padding.  
**Padding should be used if the examples are too short**, otherwise the examples should be **truncated**.
In the next cell, **the padding is set to the left**, i.e. we add our padding token at the beginning. This is recommended for Decoders such as Phi-2.

In [None]:
tokenizer.padding_side = "left"

In [None]:
# A cell to help you to answer the following question

In [None]:
display_quiz([quiz[0]])

In [None]:
print(f"By default, Phi-2 has {tokenizer.pad_token_id} padding token")

We define a token padding called `[PAD]`.

In [None]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    
print(f"The padding token is {tokenizer.pad_token} and its id is {tokenizer.pad_token_id}")

Now let's create our **validation dataset to compute perplexity**.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to write the validation dataset class called `ValidRoleplayDataset`. This will inherit from the `Dataset` class in `torch.utils.data`. The constructor will take as parameters a tokenizer, a huggingface dataset and the maximum length of a sequence. Two other methods need to be defined: `__len__(self)` and `__getitem__(self, idx)`.  `__getitem__` must return the text tokenised with input and target (offset by 1).  
**If a sequence is too long (> `seq_length`), truncation is necessary. Padding will then be performed in the batch collate function.**

In [None]:
class ValidRoleplayDataset(Dataset):
############ Complete here ############

#######################################

**Ease level 1:**

**Solution:**

In [None]:
valid_dataset = ValidRoleplayDataset(tokenizer, roleplay_dataset['valid'])

Take a look at a few examples.

In [None]:
# first sample
sample = valid_dataset[0]
print(f"model_inp shape {sample[0].shape}")
print(f"labels shape {sample[1].shape}")

In [None]:
# second sample
sample = valid_dataset[1]
print(f"model_inp shape {sample[0].shape}")
print(f"labels shape {sample[1].shape}")

**The sizes are different**. It is impossible to create batches in this case. We're going to add a function to add padding when creating batches. All the elements in the batch will be the **same size**. The padding is done in such a way as to have tensors equal to the size of the largest. **This can be less than `seq_length`**.

In [None]:
def padding_collate(batch):
    all_model_inp = []
    all_labels = []
    # get max size to add padding not necessarily up to `seq_length`
    longest_tensor = max(len(elem[0]) for elem in batch)
    
    for model_inp, labels in batch:
        nb_padding_to_add = longest_tensor - model_inp.shape[0]
        padding = (nb_padding_to_add, 0)  # (left padding, right padding)
        
        model_inp_pad = torch.nn.functional.pad(model_inp, padding, mode='constant', value=tokenizer.pad_token_id)
        labels_pad = torch.nn.functional.pad(labels, padding, mode='constant', value=tokenizer.pad_token_id)

        all_model_inp.append(model_inp_pad)
        all_labels.append(labels_pad)
    
    all_model_inp = torch.stack(all_model_inp)
    all_labels = torch.stack(all_labels)
    return all_model_inp, all_labels

We then initialize the validation dataloader.

In [None]:
valid_dataloader = DataLoader(
    valid_dataset,
    batch_size=4,
    num_workers=1,
    prefetch_factor=2,
    shuffle=False,  # shuffle is useless during validation step
    collate_fn=padding_collate
)

You can test it.

In [None]:
batch = next(iter(valid_dataloader))
print(f"model_inp shape {batch[0].shape}")
print(f"all_labels shape {batch[1].shape}")

Finally, we are reusing the **loss, its preparation and the training loop identically**.

In [None]:
# Initialize Optimizer and Criterion
# We choose the CrossEntropyLoss and Adam because they're the most used
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-6)

In [None]:
def prepare_for_loss(logits, labels):
    """Unfold the Tensors to compute the CrossEntropyLoss correctly"""
    batch_size, seq_length, vocab_size = logits.shape
    logits = logits.view(batch_size * seq_length, vocab_size)
    labels = labels.view(batch_size * seq_length)
    return logits, labels

In [None]:
def train_loop(model, dataloader, criterion, optimizer, test=False):
    model.train()
    # tqdm for a nice progress bar
    loop = tqdm(dataloader, desc="train")

    for i, (model_inp, labels) in enumerate(loop):
        optimizer.zero_grad()

        model_inp = model_inp.to("cuda")
        labels = labels.to("cuda")

        logits = model(model_inp).logits

        logits, labels = prepare_for_loss(logits, labels)
        loss = criterion(logits, labels)

        loss.backward()
        optimizer.step()

        # print next to progress bar
        loop.set_postfix(loss=loss.item())

        if i >= 50 and test:
            loop.close()
            break

    return model

We can redo a small test as before. 

In [None]:
prompt = """<|system|>Orphaned at age three, when he witnessed his mother's brutal murder, Dexter was adopted by Miami police officer Harry Morgan. Recognizing the boy's trauma and the subsequent development of his sociopathic tendencies, Harry trained Dexter to channel his gruesome bloodlust into vigilantism, killing only heinous criminals who slip through the criminal justice system.
<|user|>How do you approach a new case, Dexter?
<|assistant|>"""
generate_text(model, tokenizer, prompt)

### **Validation evaluation loop**

Now we have everything we need to set up a validation loop during the training of our Phi-2 model.

Firstly, we will **visually generate results of a few pre-determined prompts**. 
Below is a list of prompts. You can modify them as you wish.

In [None]:
visu_prompts = [
    """<|system|>Orphaned at age three, when he witnessed his mother's brutal murder, Dexter was adopted by Miami police officer Harry Morgan. Recognizing the boy's trauma and the subsequent development of his sociopathic tendencies, Harry trained Dexter to channel his gruesome bloodlust into vigilantism, killing only heinous criminals who slip through the criminal justice system.
<|user|>How do you approach a new case, Dexter?
<|assistant|>""",
    "<|system|>",
    "<|user|>", 
    "<|assistant|>",
]

The `generate` function below is very similar to `generate_text`. However, it only returns the generated content without the prompt. The `inputs` parameter corresponds to the tokenized prompts with its `input_ids` and `attention_mask`.

In [None]:
def generate(model, tokenizer, inputs):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        do_sample=False,
        max_new_tokens=100,
        pad_token_id=tokenizer.pad_token_id,
    )    
    
    input_tokens_lengths = [x.shape[0] for x in inputs["input_ids"]]
    output_tokens_lengths = [x.shape[0] for x in outputs]

    total_new_tokens = [
        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
    ]
    
    outputs_new = []
    for i, total_new_token in enumerate(total_new_tokens):
        outputs_new.append(
            tokenizer.batch_decode(
                [outputs[i][-total_new_token:]], skip_special_tokens=True
            )[0]
        )
        
    return outputs_new

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to write a function to display the results of the `visu_prompts` prompts on the screen using the `generate` function. The function will be called `generate_visu_prompts(model, tokenizer, prompts)`.

In [None]:
def generate_visu_prompts(model, tokenizer, prompts):


**Ease level 1:**

**Solution**

You can test it.

In [None]:
generate_visu_prompts(model, tokenizer, visu_prompts)

We also need to initialise perplexity. We will load the Perplexity metric with the `torchmetrics` module.

In [None]:
display_quiz([quiz[1]])

No, you don't have to compute perplexity on all tokens. The padding token must not be used in the calculation. There is no interest in knowing that our model is confident in predicting padding tokens.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Complete the initialization code for the perplexity metric to ignore token padding.

In [None]:
perplexity = Perplexity(ignore_index=### COMPLETE HERE ###).to("cuda")

**Solution:**

Now we're going to create our **evaluation loop during training**.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to create a function called `eval_loop` which takes as its parameters the model, the validation dataloader and the tokenizer. This function, called at each epoch, will visually generate the result of the visual prompts and calculate the perplexity on the evaluation set. The function must return the perplexity obtained.  
As far as how the `torchmetrics` metrics work, you can do:
- `perplexity.reset()` to reset the scores recorded so far
- `perplexity(preds, targets)` to update the perplexity. This call returns the current perplexity.
- `perplexity.compute()` to get the global perplexity so far.

In [None]:
def eval_loop(model, valid_dataloader, tokenizer):

**Ease level 1:**

**Solution:**

We can now monitor these indicators during training. It's time to write a train function over several epochs.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to write a `train` function to run training over several epochs (we'll choose 2 epochs). We will run the evaluation loop before the first epoch and after each epoch to see how the training is progressing.  
Also store the perplexity scores obtained in `perplexity_score` for each epoch.

In [None]:
perplexity_score = []

def train(model, tokenizer, train_dataloader, valid_dataloader, criterion, optimizer, epochs=2, test=True):


**Ease level 1:**

**Solution**

We can now test our model with its validation loop.

In [None]:
train(model, tokenizer, train_dataloader, valid_dataloader, criterion, optimizer, visu_prompts, epochs=2, test=True)

In [None]:
perplexity_score

**Perplexity decreases**, so our model becomes increasingly confident as it learns.

We can now verify that **perplexity is equal to $e^{loss}$** on a batch.

In [None]:
for model_inp, labels in valid_dataloader:
    model_inp = model_inp.to("cuda")
    labels = labels.to("cuda")
    print(f"labels tensor shape: {labels.shape}")
    logits = model(model_inp.to("cuda")).logits
    print(f"logits tensor shape: {logits.shape}")
    break

In [None]:
logits_reshape, labels_reshape = prepare_for_loss(logits, labels)

print(f"{logits_reshape.shape=}")
print(f"{labels_reshape.shape=}")

In [None]:
loss = criterion(logits_reshape, labels_reshape)
print(f"Exponential loss: {torch.exp(loss)}")

In [None]:
display_quiz([quiz[4]])

In [None]:
print(f"Perplexity with torchmetrics and token padding ignored: {perplexity(logits, labels).item()}")

To achieve the same perplexity, we need to take all the tokens into account. Let's give it a try.

In [None]:
perplexity_basic = Perplexity().to("cuda")
print(f"Perplexity with torchmetrics and all tokens: {perplexity_basic(logits, labels).item()}")

---
We could also have added other metrics during validation, such as the BERTScore or others.  
However, to use these metrics, we need to have a **prompt and the reference associated with this prompt as input**. We need a **"supervised" dataset**.

For example, for the following sample:
```
<|system|>King Kong is a colossal ape, a tragic figure of immense strength, wrested from his remote island home and put on display in the bustling metropolis of New York City.</s> <|user|>What is King Kong's backstory?</s> <|assistant|>I was the reigning monarch of Skull Island, a remote and mysterious land where time stood still. My world was turned upside down when I was captured and transported to New York City, where I became a spectacle for the masses.</s> <|user|>How did you feel about being taken from your home to New York City?</s> <|assistant|>I was enraged and bewildered by the sudden upheaval. The concrete jungle was a stark contrast to the lush, untamed wilderness of my island home. I longed for the familiar sights and sounds of Skull Island.</s> <|user|>Do you have any allies or friends in New York City?</s> <|assistant|>I formed a unique connection with a woman named Ann Darrow, who showed me kindness and compassion amidst the chaos of the city. She was the only one who saw beyond my fearsome exterior to the gentle soul within.</s> <|user|>How do you feel about the humans who captured and displayed you?</s> <|assistant|>I harbored a deep-seated mistrust and resentment towards the humans who exploited me for their own gain. I was a creature of untold power, yet they sought to subdue and profit from me without understanding the depth of my being.</s>
```

The input could be:
```
<|system|>King Kong is a colossal ape, a tragic figure of immense strength, wrested from his remote island home and put on display in the bustling metropolis of New York City.</s> <|user|>What is King Kong's backstory?</s> <|assistant|>
```

And the expected output to the previous prompt:
```
I was the reigning monarch of Skull Island, a remote and mysterious land where time stood still. My world was turned upside down when I was captured and transported to New York City, where I became a spectacle for the masses.</s> 
```

This would allow us to use metrics that require ground truth. This means writing a new dataset for this evaluation.  
**We'll see about that on the last day where we'll evaluate the models you've trained on the use case.**