## Metrics

There are many metrics. `lighteval` is the library that has the implementations of the most commonly used metrics in NLP. 
It comes really handy to understand these metrics and how they are computed (source code: [https://github.com/huggingface/lighteval/tree/v0.11.0/src/lighteval/metrics](https://github.com/huggingface/lighteval/tree/v0.11.0/src/lighteval/metrics)) 


The webpage: [https://huggingface.co/docs/lighteval/en/package_reference/metrics#sample-metrics](https://huggingface.co/docs/lighteval/en/package_reference/metrics#sample-metrics) gives you the high-level API of different metrics.


#### Normalizations
There have been a lot of NLP papers that propose metrics and evalautions, they might end up using different preprocessing and metrics. 
The library brings all together the the page pointed to above host all those functions in `normalizations.py`. Example: Math based: `gsm8k_normalizer`, `math_normalizer` or language based normalizers such as `helm_noramlizer`. A look can teach you a lot about the preprocessing steps. 


### Directly running the evaluation using lighteval

You can directly evaluate any LM using the lighteval-cli by providing the model and the task name


```bash
lighteval \
    --model_args pretrained=gpt2 \
    --tasks hellaswag \
    --batch_size 8 \
    --output_dir ./results
```

Also an option to use faster inference servers or multi-GPU
```bash
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm \
  "model_name=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=2" \
  "lighteval|gsm8k|5"
```

or an API

```bash
lighteval endpoint litellm \
  "provider=openai,model_name=gpt-4o-mini" \
  "lighteval|mmlu|0"
```

This is handy, but for now we focus on dealing with what goes on underneath it


In [None]:
import lighteval



# Definitions

* **Gold $g$**: the reference text provided by the dataset. For example, in QA, $g$ might be “Paris” if the question is “What is the capital of France?”. In summarization, $g$ is the human-written summary.

* **Prediction $p$**: the model’s output. Depending on the setting this could be:

  * a text string sampled from the model’s logits, e.g. “Paris, France”.
  * a log-probability vector over choices, e.g. $(\ell_1,\dots,\ell_m)$ if there are $m$ candidate answers.
  * multiple samples $p^{(1)},p^{(2)},\dots,p^{(n)}$ from the same prompt, for multi-sample metrics.

* **Choice set $\{c_1,\dots,c_m\}$**: in multiple-choice or classification tasks, the discrete set of possible answers.

---

# String-based metrics

These compare prediction text $p$ to reference $g$ directly. They do not need access to the evaluated model’s logits. They may, however, use an **external pretrained LM** (e.g. BLEURT, BERTScore, SummaCZS).

---

### Exact Match

Checks whether prediction exactly matches the gold (or its prefix/suffix).

* **Input:** prediction string $p$, reference string(s) $g$.
* **Formulation:**

  $$
  \text{EM}(p,g) = 
  \begin{cases}
  1 & \nu(p)=\nu(g) \quad \text{(full)}\\
  1 & \nu(p)\text{ startswith }\nu(g) \quad \text{(prefix)}\\
  1 & \nu(p)\text{ endswith }\nu(g) \quad \text{(suffix)}\\
  0 & \text{otherwise}
  \end{cases}
  $$

  where $\nu(\cdot)$ is optional normalization (strip, lowercase, etc.).
* **When to use:** Useful in tasks with unambiguous answers (math word problems, factual QA).
* **When not:** Not good for free-form answers with paraphrases.
* **LLM use:** **No external LM**; purely heuristic.

---

### F1 over Bag-of-Words

Measures token-level overlap between gold and prediction.

* **Input:** sets of tokens $G=\text{tok}(g)$, $P=\text{tok}(p)$.
* **Formulation:**

  $$
  \text{Prec} = \frac{|P \cap G|}{|P|},\quad
  \text{Rec} = \frac{|P \cap G|}{|G|},\quad
  F1 = \frac{2\cdot\text{Prec}\cdot\text{Rec}}{\text{Prec}+\text{Rec}}
  $$
* **When to use:** QA benchmarks like SQuAD, where partial overlap counts.
* **When not:** Sensitive to synonyms; not good for paraphrase-heavy tasks.
* **LLM use:** **No external LM**.

---

### ROUGE (1/2/L/Lsum)

Overlap-based metrics, widely used for summarization.

* **Input:** n-grams or longest common subsequence between $p$ and $g$.
* **Formulation:**
  For ROUGE-n:

  $$
  \text{ROUGE-n} = \frac{\sum_{k\in \text{n-grams}} \min(\text{count}_p(k),\ \text{count}_g(k))}{\sum_{k\in \text{n-grams}} \text{count}_g(k)}
  $$

  Output is typically recall, precision, and F1.
* **When to use:** Summarization tasks.
* **When not:** Not sensitive to meaning; can reward copying.
* **LLM use:** **No external LM**.

---

### BLEU

n-gram precision metric with brevity penalty.

* **Input:** n-grams from prediction and references.
* **Formulation:**

  $$
  \text{BLEU} = \text{BP}\cdot\exp\left(\sum_{n=1}^N w_n \log p_n\right)
  $$

  where $p_n$ is the modified n-gram precision and $\text{BP}$ is brevity penalty.
* **When to use:** Machine translation.
* **When not:** Sentence-level BLEU is unstable; not good for short answers.
* **LLM use:** **No external LM**.

---

### BLEURT

Learned metric trained to align with human judgments.

* **Input:** gold $g$, pred $p$.
* **Formulation:**

  $$
  \text{BLEURT}(p,g) = f_\theta(p,g)
  $$

  where $f_\theta$ is a pretrained BERT-like regression model.
* **When to use:** Text generation tasks where semantic similarity matters.
* **When not:** May not generalize outside English or training domain.
* **LLM use:** **Yes (BLEURT pretrained LM)**.

---

### BERTScore

Semantic similarity using contextual embeddings.

* **Input:** token embeddings of $p$ and $g$.
* **Formulation:**

  $$
  \text{Prec} = \frac{1}{|P|}\sum_{t\in P}\max_{s\in G}\cos(e_t,e_s)
  $$

  $$
  \text{Rec} = \frac{1}{|G|}\sum_{s\in G}\max_{t\in P}\cos(e_s,e_t)
  $$

  $$
  \text{F1}=\frac{2PR}{P+R}
  $$
* **When to use:** Summarization, paraphrase tasks.
* **When not:** Requires heavy pretrained model; slower.
* **LLM use:** **Yes (DeBERTa scorer)**.

---

### String Distance (Edit-based)

Uses token-level distances.

* **Input:** tokens of $p$, $g$.
* **Formulation:**

  * Edit distance: $\text{Lev}(p,g)$.
  * Edit similarity: $1-\frac{\text{Lev}(p,g)}{\max(|p|,|g|)}$.
  * Longest common prefix length.
* **When to use:** Low-level signal, e.g. code completion or data deduplication.
* **When not:** Not semantic; penalizes synonyms.
* **LLM use:** **No external LM**.

---

### Extractiveness

Measures how much a summary copies from the source.

* **Input:** source text $s$, summary $p$.
* **Formulation:**

  * Coverage = fraction of summary tokens in copied spans.
  * Density = average length of copied spans.
  * Compression = $|s|/|p|$.
* **When to use:** Summarization quality assessment.
* **When not:** Irrelevant outside summarization.
* **LLM use:** **No external LM**.

---

### Faithfulness (SummaCZS)

Zero-shot factual consistency.

* **Input:** source $s$, summary $p$.
* **Formulation:**

  $$
  \text{Faithfulness}(p,s) = f_{\text{SummaCZS}}(s,p)\in[0,1]
  $$
* **When to use:** Summarization or factual generation.
* **When not:** Expensive; depends on pretrained model.
* **LLM use:** **Yes (SummaCZS model)**.

---

# Log-probability based metrics

These require the evaluated LM’s logits or log-probs over choices.

---

### Loglikelihood Accuracy

Checks if top choice by log-prob is correct.

* **Input:** log-probs $\ell_i$ for each choice $c_i$, gold indices $\mathcal{G}$.
* **Formulation:**

  $$
  \text{LLAcc} = \mathbf{1}\!\left[\arg\max_i \ell_i \in \mathcal{G}\right]
  $$
* **When to use:** MCQ benchmarks, multiple-choice reasoning.
* **When not:** Free-form generation.
* **LLM use:** **Uses evaluated LM’s logits only**.

---

### Normalized Multi-Choice Probability

Gold probability normalized by total.

* **Input:** log-probs $\ell_i$, gold indices.
* **Formulation:**

  $$
  p_i = \exp(\ell_i),\quad
  \text{Score} = \max_{g\in\mathcal{G}} \frac{p_g}{\sum_j p_j}
  $$
* **When to use:** Probability-calibrated multiple-choice tasks.
* **When not:** Generation tasks.
* **LLM use:** **Evaluated LM only**.

---

### Probability

Unnormalized gold probability.

* **Input:** log-probs $\ell_i$, gold set.
* **Formulation:**

  $$
  \text{Score} = \max_{g\in\mathcal{G}} \exp(\ell_g)
  $$
* **When to use:** Compare model’s absolute confidence.
* **When not:** Cross-model comparisons (diff. calibration).
* **LLM use:** **Evaluated LM only**.

---

### Recall\@k

Does top-k set contain a gold?

* **Input:** log-probs $\ell_i$, gold $\mathcal{G}$.
* **Formulation:**

  $$
  \text{Recall@}k = \mathbf{1}\!\left[\mathcal{G}\cap \text{TopK}(\ell,k)\neq\emptyset\right]
  $$
* **When to use:** Retrieval-style MCQ tasks.
* **When not:** Single-gold free-form tasks.
* **LLM use:** **Evaluated LM only**.

---

### MRR (Mean Reciprocal Rank)

Ranks gold among choices.

* **Input:** log-probs $\ell_i$, gold $\mathcal{G}$.
* **Formulation:**

  $$
  r = \min_{g\in\mathcal{G}}\text{rank}(g),\quad \text{MRR} = \frac{1}{r+1}
  $$
* **When to use:** Retrieval or MCQ tasks.
* **When not:** Generation.
* **LLM use:** **Evaluated LM only**.

---

### AccGoldLikelihood

Checks token-level argmax vs. gold.

* **Input:** token logits, gold token sequence.
* **Formulation:**

  $$
  \text{Score} = \mathbf{1}[\exists\ \text{gold target with argmax(logits)=gold}]
  $$
* **When to use:** Strict token-level evaluation (e.g. code generation correctness).
* **When not:** Paraphrastic text.
* **LLM use:** **Evaluated LM only**.

---

# Sample-based metrics

These assume multiple samples $p^{(1)},\dots,p^{(n)}$ from the same prompt.

---

### Avg\@k

Average correctness across first $k$ samples.

* **Input:** scores $s_1,\dots,s_k$.
* **Formulation:**

  $$
  \text{Avg@}k = \frac{1}{k}\sum_{i=1}^k s_i
  $$
* **When to use:** Stability analysis, few-sample correctness.
* **When not:** If only one sample per prompt.
* **LLM use:** **Evaluated LM only**.

---

### Maj\@k

Majority-vote among samples.

* **Input:** normalized predictions $\nu(p^{(i)})$, gold $g$.
* **Formulation:**

  $$
  \hat{y} = \arg\max_y \text{count}\{\nu(p^{(i)})=y\},\quad
  \text{Maj@}k = s(\hat{y},g)
  $$
* **When to use:** Multiple generations per query (MCQ/short answers).
* **When not:** Free-form summarization.
* **LLM use:** **Evaluated LM only**.

---

### Pass\@k

Probability that at least one of $k$ samples solves the task (common in code eval).

* **Input:** $n$ samples, $c$ correct ones.
* **Formulation (Chen et al., 2021):**

  $$
  \text{Pass@}k = 
  \begin{cases}
  1 & n-c < k\\
  1 - \prod_{j=n-c+1}^n \left(1 - \frac{k}{j}\right) & \text{otherwise}
  \end{cases}
  $$
* **When to use:** Code generation, where multiple tries are allowed.
* **When not:** Open-ended natural language generation.
* **LLM use:** **Evaluated LM only**.

---

### G-Pass\@k

Generalizes Pass\@k with success threshold.

* **Input:** successes $c$, total $n$, draw size $k$, threshold $t$.
* **Formulation:**

  $$
  X\sim \text{Hypergeom}(N=n,K=c,n=k),\quad
  \text{G-Pass@}k(t) = \Pr[X\ge \max(\lceil kt\rceil,1)]
  $$
* **When to use:** Stricter success definitions, multi-test settings.
* **When not:** Simple code-eval tasks.
* **LLM use:** **Evaluated LM only**.

---

### JudgeLLM (SimpleQA / MTBench / MixEval)

Uses a separate LLM to grade predictions.

* **Input:** question, prediction, (optional) gold/reference.
* **Formulation:**

  $$
  \text{Score} = f_{\text{judge-LLM}}(\text{question}, p, g)
  $$
* **When to use:** Open-ended generation (e.g. chat, reasoning) where automated metrics fail.
* **When not:** When budget constraints prevent calling another LLM.
* **LLM use:** **Yes — requires external judge LLM (OpenAI, LiteLLM, vLLM, etc.)**.


## Where these metrics are used?

Got it. Here’s a consolidated table mapping **tasks → common metrics → canonical benchmarks**. I’ll cover the most frequent categories in NLP evaluations.

---

# NLP Tasks, Metrics, and Benchmarks

| Task                                                        | Common Metrics                                                                             | Example Benchmarks                                    |
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------ | ----------------------------------------------------- |
| **Question Answering (extractive / factual)**               | Exact Match (EM), F1 (bag-of-words), sometimes ROUGE                                       | SQuAD, NaturalQuestions, TriviaQA, TyDiQA             |
| **Open-domain QA (generative)**                             | BLEU, ROUGE-L, BERTScore, LLM-as-judge                                                     | MS MARCO, ELI5, TruthfulQA                            |
| **Reading Comprehension / Multiple-Choice QA**              | Loglikelihood Accuracy, Recall\@k, MRR, Normalized MC Probability                          | RACE, MMLU, ARC (Easy/Challenge)                      |
| **Summarization**                                           | ROUGE-1/2/L/Lsum, BERTScore, BLEURT, Extractiveness, Faithfulness (SummaCZS), LLM-as-judge | CNN/DailyMail, XSum, SAMSum, GovReport                |
| **Machine Translation**                                     | BLEU, chrF, COMET, BERTScore, BLEURT                                                       | WMT (e.g., WMT14 En-De, WMT21 tasks)                  |
| **Paraphrase Identification / Semantic Textual Similarity** | Pearson/Spearman corr. (STS score), BERTScore, cosine similarity                           | GLUE STS-B, MRPC, QQP                                 |
| **Text Classification**                                     | Accuracy, Precision, Recall, F1 (macro/micro)                                              | GLUE (SST-2, MNLI, QNLI), AG News, Yelp Reviews       |
| **Dialogue / Conversational Agents**                        | BLEU (historical), Distinct-n (diversity), ROUGE, LLM-as-judge                             | DailyDialog, MultiWOZ, MTBench                        |
| **Code Generation**                                         | Pass\@k, Exact Match, BLEU, CodeBLEU                                                       | HumanEval, MBPP, APPS                                 |
| **Reasoning / Math Word Problems**                          | Exact Match (final answer), EM\@k, LLM-as-judge                                            | GSM8K, AQuA-RAT, MATH, AIME24                         |
| **Toxicity / Bias / Safety**                                | Accuracy/F1 on toxic label, AUROC, calibration, safety probes                              | CivilComments, RealToxicityPrompts, BBQ, HolisticBias |
| **Information Retrieval / Ranking**                         | Recall\@k, NDCG, MRR                                                                       | MS MARCO, BEIR, TREC                                  |
| **Linguistic Acceptability**                                | Accuracy                                                                                   | CoLA (in GLUE), BLiMP                                 |
| **Natural Language Inference (NLI)**                        | Accuracy, Macro-F1                                                                         | MNLI, RTE, ANLI                                       |


### Using lighteval

Lighteval uses `Doc`([API Docs](https://huggingface.co/docs/lighteval/en/package_reference/doc)) and `ModelResponse` ([API Docs](https://huggingface.co/docs/lighteval/en/package_reference/models_outputs)) as objects to interact with metrics. 
We can import various metrics from `lighteval.metrics`.

In [None]:
# examples_lighteval.py
from lighteval.tasks.requests import Doc
from lighteval.models.model_output import ModelResponse
from lighteval.metrics.metrics import F1_score, BLEU, BLEURT, BertScore

# or call the metrics from here
from lighteval.metrics.metrics import Metrics
# Metrics.exact_match --> Uses Enum under the hood: https://github.com/huggingface/lighteval/blob/v0.11.0/src/lighteval/metrics/metrics.py

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
g = ["A quick brown fox jumps over the lazy dog."]
p = ["A fast brown fox leaps over a lazy dog."]

doc = Doc(
    task_name="similarity",
    choices=g, 
    query="", 
    gold_index=0 # tell the right response index from choices
)
mr  = ModelResponse(text=p)

bleu_score = BLEU(n_gram=2).compute(doc=doc, model_response=mr)
f1_score = F1_score().compute(doc=doc, model_response=mr)
bleurt_score = BLEURT().compute(doc=doc, model_response=mr)

print("BLEU Score:", bleu_score)
print("F1 Score:", f1_score)
print("BLEURT Score:", f1_score)


[nltk_data] Downloading package punkt_tab to /Users/gupta/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


BLEU Score: 0.3333333333333333
F1 Score: 0.6666666666666666
BLEURT Score: 0.6666666666666666
