---
title: "Evaluations"
jupyter: python3
---

Every extraction task needs a good way to evaluate whether the extracted data is correct and give it a score of how correct it is. The goal is to quantify the extraction pipeline's (model's) performance. With partial scores giving insight on how correct a data point is, usually between 0-1, the pipeline can be improved by fixing any edge cases or errors found by comparing lower scored data points.

### Imports
We import all required libraries at the start here.

In [60]:
import json
from statistics import mean
from pint import UnitRegistry

### Simple example data

In [61]:
truth = {
    "text": "result",
    "correct": "correct",
    "number": 0.45,
    "wrong": 1.023,
    "bool": False,
    "missing": 0,
}
prediction = {
    "text": "result",
    "correct": "incorrect",
    "number": 0.45,
    "wrong": 1.025,
    "bull": False,
}

### Scoring a Key-Value pair

To properly evaluate a structured output from a model, we have to walk through the structure of the ground truth and the output simultaneously. We have to check every volue we were expecting the model to return and count how many of these are correct.
With the count of the correct keys, we can calculate Precision and Recall:

precision = (|{relevant entries} ∩ {retrieved entries}|) / (|{retrieved entries}|)


recall = (|{relevant entries} ∩ {retrieved entries}|) / (|{relevant entries}|)

In [75]:
relevant_entries_intersection_retrieved_entries = []
for t_key in truth:
    if t_key in prediction:
        if truth[t_key] == prediction[t_key]:
            relevant_entries_intersection_retrieved_entries.append(t_key)
recall = len(relevant_entries_intersection_retrieved_entries) / len(truth.keys())
precision = len(relevant_entries_intersection_retrieved_entries) / len(
    prediction.keys()
)
print("Recall: ", recall, "Precision: ", precision)

Recall:  0.0 Precision:  0.0


## Evaluation of results from the Choosing a learning paradigm chapter

### Loading the results
We will use the responses from GPT-4o. We will read the JSON file that has both the predicted structured output and the expected reference ground truth.

In [63]:
with open("../finetune/OpenAI_results.json", "r") as f:
    open_ai_results = json.load(f)
    truths = open_ai_results["1-shot"]["references"]
    predictions = open_ai_results["1-shot"]["predictions"]
    # We convert the output from a JSON str to a dict object
    truths = [json.loads(truth) for truth in truths]
    predictions = [
        json.loads(prediction[prediction.index("{") : prediction.rindex("}") + 1])
        for prediction in predictions
    ]

### Defining common functions for later use

In [64]:
def common_items(dict1, dict2):
    for key in dict1:
        if key in dict2:
            if isinstance(dict1[key], dict) and isinstance(dict2[key], dict):
                for subkey, subvalue in common_items(dict1[key], dict2[key]):
                    yield f"{key}.{subkey}", subvalue
            elif isinstance(dict1[key], list) and isinstance(dict2[key], list):
                for item1 in dict1[key]:
                    for item2 in dict2[key]:
                        if isinstance(item1, dict) and isinstance(item2, dict):
                            for subkey, subvalue in common_items(item1, item2):
                                yield f"{key}.[].{subkey}", subvalue
            elif dict1[key] == dict2[key]:
                yield key, dict1[key]


def count_correct_items(dict1, dict2):
    return len([correct for correct in common_items(dict1, dict2)])


def count_leaf_keys(d):
    if isinstance(d, dict):
        count = 0
        for value in d.values():
            count += count_leaf_keys(value)
        return count
    elif isinstance(d, list):
        count = 0
        for item in d:
            count += count_leaf_keys(item)
        return count
    else:
        return 1

### Evaluation of the results

In [65]:
recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    num_correct_items = count_correct_items(truth, prediction)
    recalls.append(num_correct_items / count_leaf_keys(truth))
    precisions.append(num_correct_items / count_leaf_keys(prediction))

print(mean(recalls), mean(precisions))

0.29688510133403073 0.2954236615993505


::: {.column-margin}
The variable `truth` is the manually curated reference extraction and `prediction` is what the llm returns through the pipeline.
:::

#### F1 score

In [66]:
f1_score = 2 / ((1 / recall) + (1 / precision))
print("F1 score: ", f1_score)

F1 score:  0.36363636363636365


### Matching to ground truth

First lets re-run the evaluation on just the **inputs** section of our schema. We will use this as an example to show how using a fuzzy matching can improve our scores. Sometimes, the model misinterprets part of the structure but still understands values deeply nested in this structure. Matching allows us to score the misinterpreted section with potentially correct subsections.

We run the same evaluation routine but instead of looking for correct keys in the whole section we limit it to the **inputs** section.

You can see the improvement in the evaluation scores between the two blocks of code. Matching can give you a better understanding of where and how your model's predictions need improvement.

In [70]:
from thefuzz import fuzz
from munkres import Munkres

m = Munkres()

recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    num_correct_items = count_correct_items(
        truth["inputs"], prediction["inputs"]
    )  # Here we select just the "inputs"
    recalls.append(num_correct_items / count_leaf_keys(truth["inputs"]))
    precisions.append(num_correct_items / count_leaf_keys(prediction["inputs"]))

print(mean(recalls), mean(precisions))

0.1282237586783771 0.11026275208112904


In [71]:
from munkres import Munkres

m = Munkres()

recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    scores = [
        [fuzz.token_sort_ratio(t, p) for t in truth["inputs"]]
        for p in prediction["inputs"]
    ]
    indexes = m.compute(scores)
    matches = [
        (list(truth["inputs"].values())[t], list(prediction["inputs"].values())[p])
        for p, t in indexes
    ]
    num_correct_items = count_correct_items(matches[0][0], matches[0][1])
    recalls.append(num_correct_items / count_leaf_keys(truth["inputs"]))
    precisions.append(num_correct_items / count_leaf_keys(prediction["inputs"]))

print(mean(recalls), mean(precisions))

0.6935011404798668 0.6244815185295368


### Data normalization
Sometimes the data you have in the ground truth could be normalized to a certain format. For example, all units could be converted to SI units. The original text might have them in different units. In such a situation, it is better to convert the prediction from the model to our required units before checking those keys.

Let's take a **truth** and **prediction** consisting of values and units:

In [72]:
truth = {"voltage": {"value": 22.0, "unit": "V"}}

prediction = {"voltage": {"value": 22000.0, "unit": "mV"}}

Now, we can parse the prediction with **pint** and normalize the values to SI units.

In [73]:
ureg = UnitRegistry()
text_representation_of_value = (
    str(prediction["voltage"]["value"]) + " " + prediction["voltage"]["unit"]
)
print("Converting", text_representation_of_value)
normalized_pint_quantity = ureg(text_representation_of_value).to("V")
print("to", normalized_pint_quantity)

Converting 22000.0 mV
to 22.0 volt


Now we can check the magnitudes of our truth value and our normalized predicted value.

In [74]:
if truth["voltage"]["value"] == normalized_pint_quantity.magnitude:
    print("Predicted value is correct.")

Predicted value is correct.
