# Evaluations

Every extraction task needs a good way to evaluate whether the extracted data is correct and give it a score of how correct it is. The goal is to quantify the extraction pipeline's (model's) performance. With partial scores giving insight on how correct a data point is, usually between 0-1, the pipeline can be improved by fixing any edge cases or errors found by comparing lower scored data points.

### Imports
We import all required libraries at the start here.

In [9]:
import json
from statistics import mean
from pint import UnitRegistry

### Simple example data

To start of with a simple example, we define a set of key-value pairs as our ground truth and a dummy output from the model as prediction.

In [10]:
truth = {
    "text": "result",
    "correct": "correct",
    "number": 0.45,
    "wrong": 1.023,
    "bool": False,
    "missing": 0,
}
prediction = {
    "text": "result",
    "correct": "incorrect",
    "number": 0.45,
    "wrong": 1.025,
    "bull": False,
}

### Scoring a Key-Value pair

To properly evaluate a structured output from a model, we have to walk through the structure of the ground truth and the output simultaneously. We have to check every value we were expecting the model to return and count how many of these are correct.
With the count of the correct keys, we can calculate our metrics:

precision = (|{relevant entries} ∩ {retrieved entries}|) / (|{retrieved entries}|)


recall = (|{relevant entries} ∩ {retrieved entries}|) / (|{relevant entries}|)


F1-Score = 2 / ((1 / recall) + (1 / precision))

In [12]:
relevant_entries_intersection_retrieved_entries = []
for t_key in truth:
    if t_key in prediction:
        if truth[t_key] == prediction[t_key]:
            relevant_entries_intersection_retrieved_entries.append(t_key)
recall = len(relevant_entries_intersection_retrieved_entries) / len(truth.keys())
precision = len(relevant_entries_intersection_retrieved_entries) / len(
    prediction.keys()
)
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.33
Precision: 0.40


## Evaluation of results from the [Choosing a learning paradigm chapter](choosing-paradigm-openai-results)

### Loading the results
We will use the responses from GPT-4o. We will read the JSON file that has both the predicted structured output and the expected reference ground truth.

In [13]:
with open("../finetune/OpenAI_results.json", "r") as f:
    open_ai_results = json.load(f)
    truths = open_ai_results["1-shot"]["references"]
    predictions = open_ai_results["1-shot"]["predictions"]
    # We convert the output from a JSON str to a dict object
    truths = [json.loads(truth) for truth in truths]
    predictions = [
        json.loads(prediction[prediction.index("{") : prediction.rindex("}") + 1])
        for prediction in predictions
    ]

### Defining common functions for later use

In [81]:
def common_items(dict1, dict2):
    """Finds common keys between two dicts and yields if the values are equal."""
    for key in dict1:
        if key in dict2:
            if isinstance(dict1[key], dict) and isinstance(dict2[key], dict):
                for subkey, subvalue in common_items(dict1[key], dict2[key]):
                    yield f"{key}.{subkey}", subvalue
            elif isinstance(dict1[key], list) and isinstance(dict2[key], list):
                for item1 in dict1[key]:
                    for item2 in dict2[key]:
                        if isinstance(item1, dict) and isinstance(item2, dict):
                            for subkey, subvalue in common_items(item1, item2):
                                yield f"{key}.[].{subkey}", subvalue
            elif dict1[key] == dict2[key]:
                yield key, dict1[key]


def count_correct_items(dict1, dict2):
    """Counts the common keys between two dicts and returns the count."""
    return len([correct for correct in common_items(dict1, dict2)])


def count_leaf_keys(d):
    """Counts the number of keys at the deepest levels (leaf) of an arbitrarily nested dict"""
    if isinstance(d, dict):
        count = 0
        for value in d.values():
            count += count_leaf_keys(value)
        return count
    elif isinstance(d, list):
        count = 0
        for item in d:
            count += count_leaf_keys(item)
        return count
    else:
        return 1

### Evaluation of the results

In [82]:
recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    num_correct_items = count_correct_items(truth, prediction)
    recalls.append(num_correct_items / count_leaf_keys(truth))
    precisions.append(num_correct_items / count_leaf_keys(prediction))

recall = mean(recalls)
precision = mean(precisions)
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.30
Precision: 0.30


```{note}
The variable `truth` is the manually curated reference extraction and `prediction` is what the llm returns through the pipeline.
```

#### F1 score

In [83]:
f1_score = 2 / ((1 / recall) + (1 / precision))
print(f"F1 score: {f1_score:.2f}")

F1 score: 0.30


### Matching to ground truth

First lets re-run the evaluation on just the **inputs** section of our schema. We will use this as an example to show how using a fuzzy matching can improve our scores. Sometimes, the model misinterprets part of the structure but still understands values deeply nested in this structure. Matching allows us to score the misinterpreted section with potentially correct subsections.

We run the same evaluation routine but instead of looking for correct keys in the whole section we limit it to the **inputs** section.

You can see the improvement in the evaluation scores between the two blocks of code. Matching can give you a better understanding of where and how your model's predictions need improvement.

In [84]:
recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    num_correct_items = count_correct_items(
        truth["inputs"], prediction["inputs"]
    )  # Here we select just the "inputs"
    recalls.append(num_correct_items / count_leaf_keys(truth["inputs"]))
    precisions.append(num_correct_items / count_leaf_keys(prediction["inputs"]))

recall = mean(recalls)
precision = mean(precisions)
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.13
Precision: 0.11


In [138]:
from munkres import Munkres
from thefuzz import fuzz

m = Munkres()  # We will use the  Kuhn-Munkres algorithm for our matching

recalls = []
precisions = []
for i, truth in enumerate(truths):
    prediction = predictions[i]
    # Here we use a 2d matrix to store the string comparison score of every key from truth and predictions
    scores = [
        [fuzz.token_sort_ratio(t, p) for t in truth["inputs"]]
        for p in prediction["inputs"]
    ]
    indexes = m.compute(scores)  # We find the best matching scores for each key
    # Once we have the indexes of the matches, we collect the pairs in one list object
    matches = [
        (list(truth["inputs"].values())[t], list(prediction["inputs"].values())[p])
        for p, t in indexes
    ]
    # Now we can score according to the matches we found.
    num_correct_items = count_correct_items(matches[0][0], matches[0][1])
    recalls.append(num_correct_items / count_leaf_keys(truth["inputs"]))
    precisions.append(num_correct_items / count_leaf_keys(prediction["inputs"]))


recall = mean(recalls)
precision = mean(precisions)
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.69
Precision: 0.62


### Data normalization
Sometimes the data you have in the ground truth could be normalized to a certain format. For example, all units could be converted to SI units. The original text might have them in different units. In such a situation, it is better to convert the prediction from the model to our required units before checking those keys.

Let's take a **truth** and **prediction** consisting of values and units:

In [19]:
truth = {"mass": {"value": 22.0, "unit": "g"}}

prediction = {"mass": {"value": 22000.0, "unit": "mg"}}

Now, we can parse the prediction with **pint** and normalize the values to SI units.

In [20]:
ureg = UnitRegistry()
text_representation_of_value = (
    str(prediction["mass"]["value"]) + " " + prediction["mass"]["unit"]
)
print("Converting", text_representation_of_value)
normalized_pint_quantity = ureg(text_representation_of_value).to("g")
print("to", normalized_pint_quantity)

Converting 22000.0 mg
to 22.0 gram


Now we can check the magnitudes of our truth value and our normalized predicted value.

In [21]:
if truth["mass"]["value"] == normalized_pint_quantity.magnitude:
    print("Predicted value is correct.")

Predicted value is correct.


#### Chemically informed normalization
When we are trying to extract chemical formulas, they could be reported in various forms. To make sure we got the expected value regardless of what form it's represented in, we can convert them to their SMILES representation. This is important to make sure we have the right information and not score incorrectly.

Here, we setup example extractions from a model and show how to get their SMILES before validation.

In [112]:
from llmstructdata.utils import name_to_smiles

truth = {"solvents": ["CCCO", "CC(C)O", "CC(C)=O", "CC(=O)O", "C=O"]}

prediction = {"solvents": ["propanol", "isopropanol", "Propanone", "Ethanoic acid"]}

predictions_as_smiles = {
    "solvents": [name_to_smiles(solvent) for solvent in prediction["solvents"]]
}

number_of_values_correct = len(
    [
        solvent
        for solvent in truth["solvents"]
        if solvent in predictions_as_smiles["solvents"]
    ]
)

precision = number_of_values_correct / len(predictions_as_smiles["solvents"])
recall = number_of_values_correct / len(truth["solvents"])
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.80
Precision: 1.00


##### Inorganics
Here is how one could do a similar normalization for inorganic substances and normalize between the Hill system, IUPAC, and just a reduced empirical formula.

In [136]:
from pymatgen.core import Composition

truth = {"inorganics": ["SiC2", "CaCO3", "NaCN", "CO", "HCL"]}

prediction = {"inorganics": ["C2 Si", "C Ca O3", "Na1 C1 N1", "C1 O6"]}

predictions_as_reduced_formula = {
    "inorganics": [
        Composition(inorganic).reduced_formula for inorganic in prediction["inorganics"]
    ]
}

number_of_values_correct = len(
    [
        inorganic
        for inorganic in truth["inorganics"]
        if inorganic in predictions_as_reduced_formula["inorganics"]
    ]
)

precision = number_of_values_correct / len(predictions_as_smiles["inorganics"])
recall = number_of_values_correct / len(truth["inorganics"])
print(f"Recall: {recall:.2f}\nPrecision: {precision:.2f}")

Recall: 0.60
Precision: 0.75
