<a href="https://colab.research.google.com/github/patrickfleith/datapipes/blob/main/Evaluation_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with LLM Evaluation Metrics

>[Exact Match](#scrollTo=VnMbUCEKATj7)

>[F1-score, precision, recall](#scrollTo=gV72VzkC3ime)

> More coming soon...


There are numerous ways we can evaluate text generated by LLMs.

> **In this notebook we assume we have reference text (gold labels / ground truth) against which we can compare LLM predictions**

We'll cover evaluate without references in another notebook.

We'll use the `evaluate` library from HuggingFace. Here are all the  metrics provided by this library.

In [1]:
!pip install evaluate --quiet
# You can safely ignore ERROR related to requirements to fsspec==2024.10.0 etc.

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible

# Exact Match
This is a straightforward metric, although you could be surprised.
We used the `evaluate` library from 🤗 HuggingFace.

With `evaluate` it generally works as follow:
- A list of **references**. The ground truth labels 🙏
- A list of **predictions**. The labels of the LLM

In [2]:
from evaluate import load
exact_match_metric = load("exact_match")

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

### Exactly Exact
Here all the words are the same but there is **only 1 perfect match over 4**

In [3]:
references = ["the cat", "theater", "YELLING", "agent007"] # the ground truth labels
predictions = ["cat", "theater", "yelling?", "agent"] # what's generated from your LLM

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
)

print(round(results["exact_match"],2))

0.25


## Exactly Except
- `regexes_to_ignore`: Regex expressions of characters to ignore when calculating the exact matches. Note: these regexes are removed from the input data before the changes based on the options below (e.g. ignore_case,      ignore_punctuation, ignore_numbers) are applied.

In [4]:
results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    regexes_to_ignore=["the "]
)

print(round(results["exact_match"],2))

0.5


# Quasy Exactly
You also have the following options to ignore:
- **`ignore_case`**: Boolean, defaults to False. If true, turns everything to lowercase so that capitalization differences are ignored.
- **`ignore_punctuation`**: Boolean, defaults to False. If true, removes all punctuation before comparing predictions and references.
- **`ignore_numbers`**: Boolean, defaults to False. If true, removes all punctuation before comparing predictions and references.

In [5]:
references = ["the cat", "theater", "YELLING", "agent007"]
predictions = ["cat", "theater", "yelling?", "agent"]

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    regexes_to_ignore=["the "],
    ignore_case=True,
    ignore_punctuation=True,
    ignore_numbers=False
)

print(round(results["exact_match"],2))

0.75


# Example with full sentences
With exact match you'll probably not compare individual words but full sentences or completions.

So here is an example with 2 full sentences.

In [6]:
from evaluate import load
exact_match_metric = load("exact_match")

references = [
    "I like to eat chocolate with my coffee 😀",
    "Tomorrow, I'll graduate!! So excited"
]

predictions = [
    "I like chocolate with coffee",
    "Tomorrow, I'll graduate! So excited"
]

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    ignore_case=True,
    ignore_punctuation=True,
    ignore_numbers=True
)

print(round(results["exact_match"],2))

0.5


# F1-score, precision, recall
You might be already familiar with those metrics from machine learning classification tasks. If not, don't worry.

- **F1-score**: It is a balance between the capability to detect probably something and avoiding false detection.

- **Precision**: Out of all the answers the model gave, how many were actually correct? (Avoiding false alarms or incorrect guesses.)

- **Recall**: Out of all the actually correct answers possible, how many did the model find? (Avoiding missing correct answers.)

There are scenarios in which it can be useful. For instance when we use LLM as a classifier (although I wouldn't recommend doing that, instead just use `SetFit`).

For instance let's assume you have a dataset of:
`True or False: statement?` where you would expect the LLM to answer either True or False. That reduces the evaluation to classification problem, and we can use precision, recall and f1 score.

**What is the F1 score?**
The F1 is the harmonic mean of the precision and recall. It can be computed with the equation:

$$
F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
$$



In [42]:
# let's define a dummary model that randomly predicts true or false
import numpy as np

np.random.seed(43)
def llm(fact_list)->list[bool]:
    predictions = [np.random.choice([True, False], ) for item in fact_list]
    return predictions

In [46]:
true_or_false_statements = [
    "The Earth is a perfect Sphere",
    "Thomas Pesquet is the First french astronaut on the Moon",
    "If you could jump straight up more than 100 km, you fall back down to the surface",
    "Mercury is closer the the Earth than Uranus"
]

references = [False, False, True, True]
predictions = llm(fact_list=true_or_false_statements)
predictions

[False, True, True, False]

In [47]:
from evaluate import load
f1_metric = load("f1")

# trun boolean to 0/1
references = [int(item) for item in references]
predictions = [int(pred) for pred in predictions]

results = f1_metric.compute(
    references=references,
    predictions=predictions,
    # sample_weight=[2, 1, 3, 1]
)

print(round(results['f1'], 2))

0.5


In [48]:
predictions

[0, 1, 1, 0]

In [49]:
references

[0, 0, 1, 1]

**You could also do multi-class**
- That is especially useful for multiple choice question answering (with one correct answer out of several).
- In that case you have few options on how to average over each class (like in sklearn f1_score) with `macro`, `micro` or `weighted`.

In [8]:
# here we have 3 classes: 0, 1, 2
predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

results = f1_metric.compute(predictions=predictions, references=references, average="macro")
print(round(results['f1'], 2))
results = f1_metric.compute(predictions=predictions, references=references, average="micro")
print(round(results['f1'], 2))
results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
print(round(results['f1'], 2))

0.27
0.33
0.27


If `average` is set to `None`, the scores for each class are returned.

In [9]:
results = f1_metric.compute(predictions=predictions, references=references, average=None)
print(results)

{'f1': array([0.8, 0. , 0. ])}


In [10]:
recall_metric = load("recall")
precision_metric = load("precision")

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

In [11]:
recall_metric.compute(
    references=[0, 1, 0, 1, 0, 1, 0],
    predictions=[0, 0, 1, 1, 0, 1, 1],
)

{'recall': 0.6666666666666666}

In [12]:
precision_metric.compute(
    references=[0, 1, 0, 1, 0],
    predictions=[0, 0, 1, 1, 0],
)

{'precision': 0.5}