### Evaluvation metrics

Categories of metrics
There are 3 high-level categories of metrics:

- Generic metrics, which can be applied to a variety of situations and datasets, such as precision and accuracy.
- Task-specific metrics, which are limited to a given task, such as Machine Translation (often evaluated using metrics BLEU or ROUGE) or Named Entity Recognition (often evaluated with seqeval).
- Dataset-specific metrics, which aim to measure model performance on specific benchmarks: for instance, the GLUE benchmark has a dedicated evaluation metric.

**Generic metrics**

This is the case for metrics like **accuracy** and **precision**, which can be used for evaluating labeled (supervised) datasets, as well as **perplexity**, which can be used for evaluating different kinds of (unsupervised) generative tasks.


In [1]:
!pip install -q evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import evaluate
precision_metric = evaluate.load("precision")
results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
print(results)

Downloading builder script: 0.00B [00:00, ?B/s]

{'precision': 1.0}


**Task-specific metrics**

- Popular ML tasks like Machine Translation and Named Entity Recognition have specific metrics that can be used to compare models. For example, a series of different metrics have been proposed for text generation, ranging from BLEU and its derivatives such as GoogleBLEU and GLEU, but also ROUGE, MAUVE, etc.


**Dataset-specific metrics**

- Some datasets have specific metrics associated with them — this is especially in the case of popular benchmarks like GLUE and SQuAD.


In [3]:
import evaluate
evaluate.load('perplexity')

Downloading builder script: 0.00B [00:00, ?B/s]

EvaluationModule(name: "perplexity", module_type: "metric", features: {'predictions': Value('string')}, usage: """
Args:
    model_id (str): model used for calculating Perplexity
            NOTE: Perplexity can only be calculated for causal language models.
                    This includes models such as gpt2, causal variations of bert,
                    causal versions of t5, and more (the full list can be found
                    in the AutoModelForCausalLM documentation here:
                    https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )

    predictions (list of str): input text, each separate text snippet
        is one list entry.
    batch_size (int): the batch size to run texts through the model. Defaults to 16.
    add_start_token (bool): whether to add the start token to the texts,
        so the perplexity can include the probability of the first word. Defaults to True.
    device (str): device to run on, defaul

**Perplexity**

Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note that the metric applies specifically to classical language models and is not well defined for masked language models like BERT

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
sequence $X = (x_0, x_1, \dots, x_t)$, then the perplexity of $X$ is,

$$ \text{PPL}(X) = \exp\left\{ -\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i})  \right\} $$

where $\log p_\theta (x_i|x_{<i})$ is the log-likelihood of the ith token conditioned on the preceding tokens $x_{<i}$ according to our model.

- Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus.
- Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.

This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.



## Calculating PPL with fixed-length models

- If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below.

<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_full.gif"/>

- When working with approximate models, however, we typically have a constraint on the number of tokens the model can
process. The largest version of [GPT-2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
cannot calculate $p_\theta(x_t|x_{<t})$ directly when $t$ is greater than 1024.

- Instead, the sequence is typically broken into sub-sequences with equal length, where the length is equal to the model's maximum input size.

- If a model's max input size is $k$, we then approximate the likelihood of a token $x_t$ by conditioning only on the $k-1$ tokens that precede it rather than the entire context.

- When evaluating the model's perplexity of a
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed log-likelihoods of each segment independently.

<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_chunked.gif"/>

- This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will
have less context at most of the prediction steps.

- Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly
sliding the context window so that the model has more context when making each prediction.

<img width="600" alt="Sliding window PPL taking advantage of all available context" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_sliding.gif"/>

- This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more
favorable score. The downside is that it requires a separate forward pass for each token in the corpus.
- A good practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by 1 token a time. This allows computation to proceed much faster while still giving the model a large context to make predictions at each step.


In [4]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = 'cuda'
model_id = "openai-community/gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

- We'll load in the test part of WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies.
- Since this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire dataset in memory.

In [5]:
from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")

README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors


In [6]:
# We tokenizer the whole test dataset
encodings

{'input_ids': tensor([[ 628,  796, 5199,  ...,  220,  628,  198]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative log-likelihood for each token is returned as the loss.

- With our sliding window approach, however, there is overlap in
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating
as context to be included in our loss,

- so we can set these targets to `-100` so that they are ignored. The following
is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens
available to condition on).

In [13]:
import torch
from tqdm import tqdm

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

In [14]:
nll_sum = 0.0
n_tokens = 0
prev_end_loc = 0

for begin_loc in tqdm(range(0, seq_len, stride)):
  end_loc = min(begin_loc + max_length, seq_len)
  trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
  input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
  target_ids = input_ids.clone()
  target_ids[:, :-trg_len] = -100

  with torch.no_grad():
    outputs = model(input_ids, labels=target_ids)

  # loss is calculated using CrossEntropyLoss which averages over valid labels
  # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels to the left by 1
  neg_log_likelihood = outputs.loss

  # Accumulate the total negative log-likelihood and the total number of tokens
  num_valid_tokens = (target_ids != -100).sum().item()  # number of valid tokens in target_ids
  batch_size = target_ids.size(0)
  num_loss_tokens = num_valid_tokens - batch_size  # subtract batch_size due to internal label shift
  nll_sum += neg_log_likelihood * num_loss_tokens
  n_tokens += num_loss_tokens

  prev_end_loc = end_loc
  if end_loc == seq_len:
    break

avg_nll = nll_sum / n_tokens  # average negative log-likelihood per token
ppl = torch.exp(avg_nll)

100%|█████████▉| 560/562 [00:52<00:00, 10.58it/s]


In [15]:
ppl

tensor(25.1704, device='cuda:0')

Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction,
and the better the reported perplexity will typically be.

When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same
as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window
strategy, this jumps down to `16.44`. This is not only a more favorable score, but is calculated in a way that is
closer to the true autoregressive decomposition of a sequence likelihood.

**Using package**

In [18]:
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts) # doctest:+ELLIPSIS
print(results)
print(list(results.keys()))

  0%|          | 0/1 [00:00<?, ?it/s]

{'perplexities': [32.254302978515625, 1499.69482421875, 408.27459716796875], 'mean_perplexity': np.float64(646.7412414550781)}
['perplexities', 'mean_perplexity']


**Using in Datasets**

In [19]:
from datasets import load_dataset
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] # doctest: +SKIP
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(model_id='gpt2',
                            predictions=input_texts)

  0%|          | 0/1 [00:00<?, ?it/s]

In [20]:
print(list(results.keys()))
print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
print(round(results["perplexities"][0], 2)) # doctest: +SKIP

['perplexities', 'mean_perplexity']
244.06
567.91
