In [None]:
import numpy as np
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM
import evaluate
from datasets import load_dataset
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

### Some basics of LLM metrics 


The evaluation of language models involves various metrics, each of which provides different insights into the model's performance. Here we describe some of the common metrics used in evaluating language models.

#### Perplexity (PPL)

- **Mathematical Definition**: $PPL(W) = P(w_1, w_2, \ldots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}}$, where $W$ is the sequence of words and $N$ is the number of words.
- **Range**: $[1, \infty)$.
- **Use Cases**: Commonly used to evaluate language models, especially in the context of text generation and next-word prediction.
- **Pros**:
  - Intuitive interpretation as the weighted average branching factor of the language.
  - Lower perplexity indicates a better model (less surprised by the test data).
- **Cons**:
  - Highly sensitive to data sparsity and may not always correlate with human judgments of text quality.

#### BLEU (Bilingual Evaluation Understudy)

- **Mathematical Definition**:  

  $$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$

  where:
  - $BP$ is the brevity penalty to penalize short translations.
  - $w_n$ is the uniform weight for each $n$-gram (typically $w_n  = \frac{1}{N}$ for all $n$).
  - $p_n$ is the precision of $n$-grams.

- **Range**: $[0, 1]$, often multiplied by 100 to give a percentage.
- **Use Cases**: Machine translation quality estimation, also used for text summarization and caption generation.
- **Pros**:
  - Language-independent and easy to understand.
  - Correlates well with human judgment at the corpus level.
- **Cons**:
  - Does not account for meaning or grammatical correctness.
  - Heavily reliant on reference translations, which may not encapsulate all valid translations.
  
An aside, Precision for n-grams is defined as the ratio of the number of matching n-grams to the number of n-grams in the generated sequence.


  $$\text{Precision}_{n} = \frac{\sum_{\text{n-gram} \in \text{Hypothesis}} \min\left(\text{Count}(\text{n-gram}), \text{Count}_{\text{Reference}}(\text{n-gram})\right)}{\sum_{\text{n-gram} \in \text{Hypothesis}} \text{Count}(\text{n-gram})}$$

  where:
  - $\text{Count}(\text{n-gram})$ is the number of occurrences of the n-gram in the generated sequence (hypothesis).
  - $\text{Count}_{\text{Reference}}(\text{n-gram})$ is the number of occurrences of the n-gram in the reference sequence but clipped to the maximum number found in any single reference sequence.

N-gram precision is used to assess the overlap between a candidate translation and one or more reference translations, focusing on exact word matches. The clipping is important to prevent a system from getting undue credit for repeated phrases.

- **Note**: In practice, for BLEU score calculations, precision is calculated for multiple n-gram lengths (e.g., 1-gram, 2-gram, 3-gram, and 4-gram) and combined using a weighted geometric mean, with brevity penalty incorporated to account for overly short translations.

#### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

- **Mathematical Definition**: For ROUGE-N:

  $$ROUGE\text{-}N = \frac{\sum_{S \in \{References\}} \sum_{gram_n \in S} Count_{match}(gram_n)}{\sum_{S \in \{References\}} \sum_{gram_n \in S} Count(gram_n)}$$

  where:
  - $Count_{match}(gram_n)$ is the number of $n$-grams that appear in both the system output and the reference.
  - $Count(gram_n)$ counts the occurrences in the reference summaries.

- **Range**: $[0, 1]$.
- **Use Cases**: Mainly used for evaluating text summarization and can also be applied to machine translation.
- **Pros**:
  - Takes into account both the precision and recall, providing a more balanced view of performance.
- **Cons**:
  - Like BLEU, it is also limited by the quality and variety of reference summaries.

#### METEOR (Metric for Evaluation of Translation with Explicit Ordering)

- **Mathematical Definition**:  

  $$METEOR = \frac{10P \cdot R}{R + 9P} \cdot \left(1 - \frac{0.5 \cdot C}{Unigrams_{test}}\right)$$

  where:
  - $P$ is the precision of unigram matches.
  - $R$ is the recall of unigram matches.
  - $C$ is the number of chunks (contiguous unigram matches) in the alignment.
  - $Unigrams_{test}$ is the total number of unigrams in the test output.

- **Range**: $[0, 1]$.
- **Use Cases**: It is used to evaluate the quality of machine translation outputs.
- **Pros**:
  - Accounts for word order and synonymy, achieving a better correlation with human judgments.
- **Cons**:
  - Complex computation with several stages.

#### BERTScore

BERTScore is an evaluation metric that computes the similarity of contextual embeddings from pre-trained models such as BERT for various aspects of text generation quality.

- **Mathematical Definitions**:
  - **Precision**: Measures coverage of the candidate's tokens in the reference.
    $$ P = \frac{1}{|C|} \sum_{i=1}^{|C|} \max_{j=1}^{|R|} \text{cos}(c_i, r_j) $$
  - **Recall**: Measures coverage of the reference's tokens in the candidate.
    $$ R = \frac{1}{|R|} \sum_{i=1}^{|R|} \max_{j=1}^{|C|} \text{cos}(r_i, c_j) $$
  - **F1 Score**: The harmonic mean of precision and recall.
    $$ F1 = \frac{2 \cdot P \cdot R}{P + R} $$

  where:
  - $|C|$ is the number of tokens in the candidate (generated) text.
  - $|R|$ is the number of tokens in the reference text.
  - $c_i$ is the embedding of the $i$-th token in the candidate text.
  - $r_j$ is the embedding of the $j$-th token in the reference text.
  - $\text{cos}$ denotes the cosine similarity function.

- **Range**:
  - **Precision** and **Recall**: Typically $[0, 1]$ with 1 indicating perfect precision or recall.
  - **F1 Score**: Also typically $[0, 1]$ with 1 being the best F1 score.

- **Use Cases**: BERTScore is used for evaluating the quality of generated text in tasks such as translation, summarization, text generation, and more. It provides individual measurements for precision, recall, and F1, offering a multifaceted view of a model’s performance.

- **Pros**:
  - Provides a more nuanced evaluation by computing separate scores for precision, recall, and F1.
  - Captures semantic similarity better than overlap-based metrics, like BLEU.
  - Robust to paraphrasing and more sensitive to the meaning of the text.

- **Cons**:
  - Computationally intensive, as it uses contextual embeddings from large transformer models.
  - The high-resource requirement to run evaluations, especially with large datasets.
  - The need for careful selection of baseline or reference models to ensure fair comparison.

## Load a small evaluation dataset and subsample to 100 reference texts

In [2]:
dataset = load_dataset('cnn_dailymail', '3.0.0', split='validation')
dataset = dataset.shuffle(seed=42).select(range(10)) # Subsampling to 100 reference texts

# Preview the structure of the dataset
print(dataset.column_names) 

['article', 'highlights', 'id']


## Load 2 small LLMs 

In [3]:
model_names = ["sshleifer/distilbart-cnn-6-6", "sshleifer/distilbart-cnn-12-6"]
models = []
tokenizers = []

# Load models and tokenizers
for model_name in model_names:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) # Ensure using fast tokenizers
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to('cuda')
    models.append(model)
    tokenizers.append(tokenizer)

## Define function for evaluation 

In [20]:
def evaluate_models(models, model_names, tokenizers, dataset, metrics):
    results = {}
    for model, tokenizer, name in zip(models, tokenizers, model_names):
        model.to('cuda') # Move model to GPU
        
        model_results = {}
        for metric_name in metrics:
            metric = evaluate.load(metric_name)
            generated_texts = []
            references = []
            for example in dataset:
                # Make sure to generate text using the model and move it to the GPU
                input_text = example['article']  # Article text
                reference = example['highlights']  # Associated highlights or summary
                
                # Tokenize the input and generate the summary outputs
                inputs = tokenizer(input_text, return_tensors='pt', max_length=1024, truncation=True)
                inputs = inputs.to('cuda')  # Move tokenized inputs to GPU
                summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
                generated = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
                
                generated_texts.append(generated)
                references.append(reference)
            
            # Calculate metrics
            if metric_name in ['bleu', 'rouge']:
                # For `bleu` and `rouge`, references are expected to be lists of lists of strings
                metric_result = metric.compute(predictions=generated_texts, references=[[r] for r in references])
            elif metric_name == 'bertscore':
                # `bertscore` expects lists of strings for both predictions and references
                metric_result = metric.compute(predictions=generated_texts, references=references,  lang='en', device='cuda')
            else:
                raise ValueError("Metric not supported")
            
            model_results[metric_name] = metric_result
        results[name] = model_results
    
    return results

### Generate evaluation table 

In [21]:
metrics = ['rouge', 'bertscore', 'bleu'] # BLEU is less common for summarization tasks, so you might skip it
evaluation_results = evaluate_models(models, model_names, tokenizers, dataset, metrics)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# First, iterate through evaluation results to structure the dictionary as needed beforehand
structured_results = {}
for model_name in evaluation_results:
    for metric in metrics:  # 'rouge', 'bertscore', 'bleu'
        for key, value in evaluation_results[model_name][metric].items():
            # Add an entry for each sub-metric for each model
            structured_results[(metric, key)] = structured_results.get((metric, key), []) + [value]

# Now we create a DataFrame with the structured results
evaluation_table = pd.DataFrame(structured_results, index=model_names)

# If you want to reorder the level of columns you can do so as follows
# Let's say you wanted 'rouge-1', 'rouge-2', 'rouge-L', etc. under 'rouge'
# and 'precision', 'recall', 'f1' under 'bertscore'.
evaluation_table = evaluation_table.reindex(columns=pd.MultiIndex.from_tuples(
    [(metric, sub_metric) for metric in metrics for sub_metric in sorted(evaluation_table.columns.get_level_values(1))
     if (metric, sub_metric) in evaluation_table.columns]
), fill_value=0)

# Show the evaluation table with metrics as top-level headers and sub-metrics as secondary headers
display(evaluation_table)


Unnamed: 0_level_0,rouge,rouge,rouge,rouge,bertscore,bertscore,bertscore,bertscore,bleu,bleu,bleu,bleu,bleu,bleu
Unnamed: 0_level_1,rouge1,rouge2,rougeL,rougeLsum,f1,hashcode,precision,recall,bleu,brevity_penalty,length_ratio,precisions,reference_length,translation_length
sshleifer/distilbart-cnn-6-6,0.38405,0.151075,0.248238,0.317526,"[0.9208353161811829, 0.8709611296653748, 0.850...",roberta-large_L17_no-idf_version=0.3.12(hug_tr...,"[0.9184439778327942, 0.8651304244995117, 0.853...","[0.9232390522956848, 0.8768709897994995, 0.848...",0.126661,0.915957,0.919298,"[0.41412213740458015, 0.1575875486381323, 0.08...",570,524
sshleifer/distilbart-cnn-12-6,0.446609,0.2198,0.32285,0.375602,"[0.9448792934417725, 0.916256844997406, 0.8688...",roberta-large_L17_no-idf_version=0.3.12(hug_tr...,"[0.9340245723724365, 0.9070568680763245, 0.864...","[0.9559891819953918, 0.9256454110145569, 0.873...",0.181944,1.0,1.114035,"[0.4393700787401575, 0.2, 0.13008130081300814,...",570,635


10