<a href="https://colab.research.google.com/github/patrickfleith/datapipes/blob/main/Evaluation_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with LLM Evaluation Metrics

>[Exact Match](#scrollTo=VnMbUCEKATj7)

>[F1-score, precision, recall](#scrollTo=gV72VzkC3ime)

>[Text Normalization](#scrollTo=aOgVWdvTlcOH)

>[Damerau-Levenshtein Distance](#scrollTo=zklowmjTn-lF)

>[Embedding Distance](#scrollTo=pHZXxBPcs0Q4)



There are numerous ways we can evaluate text generated by LLMs.

> **In this notebook we assume we have reference text (gold labels / ground truth) against which we can compare LLM predictions**

We'll cover evaluate without references in another notebook.

We'll use the `evaluate` library from HuggingFace. Here are all the  metrics provided by this library.

In [1]:
!pip install evaluate --quiet
# You can safely ignore ERROR related to requirements to fsspec==2024.10.0 etc.

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following depend

# Exact Match
This is a straightforward metric, although you could be surprised.
We used the `evaluate` library from 🤗 HuggingFace.

With `evaluate` it generally works as follow:
- A list of **references**. The ground truth labels 🙏
- A list of **predictions**. The labels of the LLM

In [2]:
from evaluate import load
exact_match_metric = load("exact_match")

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

### Exactly Exact
Here all the words are the same but there is **only 1 perfect match over 4**

In [3]:
references = ["the cat", "theater", "YELLING", "agent007"] # the ground truth labels
predictions = ["cat", "theater", "yelling?", "agent"] # what's generated from your LLM

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
)

print(round(results["exact_match"],2))

0.25


## Exactly Except
- `regexes_to_ignore`: Regex expressions of characters to ignore when calculating the exact matches. Note: these regexes are removed from the input data before the changes based on the options below (e.g. ignore_case,      ignore_punctuation, ignore_numbers) are applied.

In [4]:
results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    regexes_to_ignore=["the "]
)

print(round(results["exact_match"],2))

0.5


## Quasy Exactly
You also have the following options to ignore:
- **`ignore_case`**: Boolean, defaults to False. If true, turns everything to lowercase so that capitalization differences are ignored.
- **`ignore_punctuation`**: Boolean, defaults to False. If true, removes all punctuation before comparing predictions and references.
- **`ignore_numbers`**: Boolean, defaults to False. If true, removes all punctuation before comparing predictions and references.

In [5]:
references = ["the cat", "theater", "YELLING", "agent007"]
predictions = ["cat", "theater", "yelling?", "agent"]

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    regexes_to_ignore=["the "],
    ignore_case=True,
    ignore_punctuation=True,
    ignore_numbers=False
)

print(round(results["exact_match"],2))

0.75


## Example with full sentences
With exact match you'll probably not compare individual words but full sentences or completions.

So here is an example with 2 full sentences.

In [6]:
from evaluate import load
exact_match_metric = load("exact_match")

references = [
    "I like to eat chocolate with my coffee 😀",
    "Tomorrow, I'll graduate!! So excited"
]

predictions = [
    "I like chocolate with coffee",
    "Tomorrow, I'll graduate! So excited"
]

results = exact_match_metric.compute(
    references=references,
    predictions=predictions,
    ignore_case=True,
    ignore_punctuation=True,
    ignore_numbers=True
)

print(round(results["exact_match"],2))

0.5


# F1-score, precision, recall
You might be already familiar with those metrics from machine learning classification tasks. If not, don't worry.

- **F1-score**: It is a balance between the capability to detect probably something and avoiding false detection.

- **Precision**: Out of all the answers the model gave, how many were actually correct? (Avoiding false alarms or incorrect guesses.)

- **Recall**: Out of all the actually correct answers possible, how many did the model find? (Avoiding missing correct answers.)

There are scenarios in which it can be useful. For instance when we use LLM as a classifier (although I wouldn't recommend doing that, instead just use `SetFit`).

For instance let's assume you have a dataset of:
`True or False: statement?` where you would expect the LLM to answer either True or False. That reduces the evaluation to classification problem, and we can use precision, recall and f1 score.

**What is the F1 score?**
The F1 is the harmonic mean of the precision and recall. It can be computed with the equation:

$$
F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
$$



In [7]:
# let's define a dummary model that randomly predicts true or false
import numpy as np

np.random.seed(43)
def llm(fact_list)->list[bool]:
    predictions = [np.random.choice([True, False], ) for item in fact_list]
    return predictions

In [8]:
true_or_false_statements = [
    "The Earth is a perfect Sphere",
    "Thomas Pesquet is the First french astronaut on the Moon",
    "If you could jump straight up more than 100 km, you fall back down to the surface",
    "Mercury is closer the the Earth than Uranus"
]

references = [False, False, True, True]
predictions = llm(fact_list=true_or_false_statements)
predictions

[True, True, False, False]

In [9]:
from evaluate import load
f1_metric = load("f1")

# trun boolean to 0/1
references = [int(item) for item in references]
predictions = [int(pred) for pred in predictions]

results = f1_metric.compute(
    references=references,
    predictions=predictions,
    # sample_weight=[2, 1, 3, 1]
)

print(round(results['f1'], 2))

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

0.0


In [10]:
predictions

[1, 1, 0, 0]

In [11]:
references

[0, 0, 1, 1]

**You could also do multi-class**
- That is especially useful for multiple choice question answering (with one correct answer out of several).
- In that case you have few options on how to average over each class (like in sklearn f1_score) with `macro`, `micro` or `weighted`.

In [12]:
# here we have 3 classes: 0, 1, 2
predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

results = f1_metric.compute(predictions=predictions, references=references, average="macro")
print(round(results['f1'], 2))
results = f1_metric.compute(predictions=predictions, references=references, average="micro")
print(round(results['f1'], 2))
results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
print(round(results['f1'], 2))

0.27
0.33
0.27


If `average` is set to `None`, the scores for each class are returned.

In [13]:
results = f1_metric.compute(predictions=predictions, references=references, average=None)
print(results)

{'f1': array([0.8, 0. , 0. ])}


In [14]:
recall_metric = load("recall")
precision_metric = load("precision")

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

In [15]:
recall_metric.compute(
    references=[0, 1, 0, 1, 0, 1, 0],
    predictions=[0, 0, 1, 1, 0, 1, 1],
)

{'recall': 0.6666666666666666}

In [16]:
precision_metric.compute(
    references=[0, 1, 0, 1, 0],
    predictions=[0, 0, 1, 1, 0],
)

{'precision': 0.5}

## Text Normalization

Text normalisation is converting text into a standard format with reduced variability

It is not one operations, but a collection of small transformations you could choose to apply of not to your text.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲𝘀 👇️
- Lowercasing text: "Hello" → "hello"
- Removing punctuation "Hello, world!" → "Hello world"
- Removing stopwords: Eliminate common words like "a," "an," and "the" that don't add much meaning in some contexts.
- Remove extra spaces and normalise them to a single space: " Hello world " → "Hello world"
- Reduce words to their base or root forms "running" → "run"
- Convert numbers to words (e.g., "1" to "one") and expand abbreviations (e.g., "Dr." to "Doctor")
- etc...

We normalise to avoid penalisation due to irrelevant variations
- normalise your reference text
- normalise your prediction
Now compare for evaluation

But, be careful! Text normalisation strategies can vary based on the problem you're solving. Sometimes, you might want to keep punctuation, or acronyms or something else!



In [17]:
import re
import string

def normalize_text(s):
    """
    Normalize a text string by applying several transformations:
    1. Convert all characters to lowercase.
    2. Remove punctuation marks.
    3. Remove articles ("a", "an", "the").
    4. Remove extra whitespace.
    """
    ARTICLES_REGEX = re.compile(r"\b(a|an|the)\b", re.UNICODE)

    def remove_articles(text):
        return ARTICLES_REGEX.sub(" ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [18]:
normalize_text("The Quick, brown fox jumped over the Lazy DOG!!")

'quick brown fox jumped over lazy dog'

## Damerau-Levenshtein Distance

It measures the **minimum number of single-character edits required to transform one string into another**

Available operations of the Levenshtein distance:
- character 𝗶𝗻𝘀𝗲𝗿𝘁𝗶𝗼𝗻
- character 𝗱𝗲𝗹𝗲𝘁𝗶𝗼𝗻
- character 𝘀𝘂𝗯𝘀𝘁𝗶𝘁𝘂𝘁𝗶𝗼𝗻

With Damerau we have one more operation
- swapping two adjacent characters (𝘁𝗿𝗮𝗻𝘀𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻)
    - detect errors such as typos, where letters are swapped (e.g., "adn" → "and")

Why / When it is good 🤗
- 𝘀𝗶𝗺𝗽𝗹𝗲 𝗮𝗻𝗱 𝗶𝗻𝘁𝘂𝗶𝘁𝗶𝘃𝗲
- great for near match, minor variation
- 𝗴𝗿𝗲𝗮𝘁 𝗳𝗼𝗿 𝘀𝗵𝗼𝗿𝘁 𝗮𝗻𝘀𝘄𝗲𝗿 𝘁𝗮𝘀𝗸𝘀
- better when precise words are expected
- language agnostic
- computational efficient

Why / When not good ❌
- 𝘀𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝘀𝗶𝗺𝗶𝗹𝗮𝗿𝗶𝘁𝘆 𝗶𝘀 𝗶𝗻𝗴𝗼𝗿𝗲𝗱. "Car" and "Automobile" would be considerable different.
- 𝗦𝘁𝗿𝘂𝗴𝗴𝗹𝗲𝘀 𝘄𝗶𝘁𝗵 𝗹𝗼𝗻𝗴𝗲𝗿 𝘁𝗲𝘅𝘁
- negation can kill it - "I will come with you" is the opposite of "I will not come with you" but with still be fairly similar under DL distance.

In [19]:
!pip install pyxDamerauLevenshtein --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for pyxDamerauLevenshtein (pyproject.toml) ... [?25l[?25hdone


In [20]:
from pyxdameraulevenshtein import damerau_levenshtein_distance
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance

In [21]:
references = [
    "I love chocolate, with my coffee in the morning!!",
    "hey, the cat is here!"
]

predictions = [
    "I eat chocolate with coffee every morning",
    "hey, your cat is here!"
]

In [22]:
for r, p in zip(references, predictions):

    print("----- w/o text normalization\n")
    print(f"regular: \t{damerau_levenshtein_distance(r, p)}")
    print(f"normalized: \t{round(normalized_damerau_levenshtein_distance(r, p),3)}\n")

    print("----- with text normalization\n")
    nr, np = normalize_text(r), normalize_text(p)
    print(f"regular: \t{damerau_levenshtein_distance(nr, np)}")
    print(f"normalized: \t{round(normalized_damerau_levenshtein_distance(nr, np),3)}\n\n")

----- w/o text normalization

regular: 	16
normalized: 	0.327

----- with text normalization

regular: 	12
normalized: 	0.286


----- w/o text normalization

regular: 	4
normalized: 	0.182

----- with text normalization

regular: 	5
normalized: 	0.25




# Embedding Distance

In [92]:
from sentence_transformers import SentenceTransformer
import numpy as np

def calculate_similarity_score(references, predictions):
    # Load a pretrained Sentence Transformer model
    model = SentenceTransformer("all-MiniLM-L6-v2")

    similarity_score = {"overall": 0.0, "scores": []}

    for refs, preds in zip(references, predictions):
        # Encode all references and predictions in batches
        ref_embeddings = model.encode(refs, convert_to_tensor=True)
        pred_embeddings = model.encode(preds, convert_to_tensor=True)

        # Calculate the similarity matrix
        similarity_matrix = model.similarity(ref_embeddings, pred_embeddings)

        # Find the maximum similarity score for each reference
        max_sim = similarity_matrix.max(dim=1).values.max().item()
        similarity_score["scores"].append(max_sim)

    # Calculate the overall mean score
    similarity_score["overall"] = np.mean(similarity_score["scores"])

    return similarity_score

# Example usage
references = [
    ["2024.", "two thousand twenty-four"],
    ["Hello"],
    ["I like you"]
]
predictions = [
    ["Year 2024"],
    ["Hi"],
    ["Planet Earth is big"]
]

similarity_score = calculate_similarity_score(references, predictions)
print(similarity_score)

{'overall': 0.6170853773752848, 'scores': [0.8899925947189331, 0.8071528673171997, 0.15411067008972168]}
