# Preprocessing & Controlled Output

Before performing large-scale extraction, it's important to **understand the structure and limitations** of the data and model outputs.

This step is part of the **preprocessing workflow** and helps you identify:

- What kind of information the model is able to extract reliably,
- Which fields are ambiguous or frequently missing or incorrect,
- And how to optimize and reduce common errors in automatic extraction.


## 🧩 Constrained Output Formats

One powerful strategy is to constrain the output format of the model – for example, forcing it to respond in a fixed **JSON structure**.  
This makes results easier to validate, parse, and store in downstream pipelines.

To define and validate such structured outputs in Python, we use **[Pydantic](https://docs.pydantic.dev/latest/)** – a robust library for data modeling and validation.

Below, we define an example schema for extracting relevant data from polymer-related texts:

- `name`: The name of the polymer  
- `synthesis_method`: How the polymer was synthesized  
- `synthesis_temperature`: The temperature which was used to synthesize the polymer  
- `homopolymer`: Boolean indicating whether the polymer is a homopolymer or not  

We will prompt the model to return its answer strictly in this format and validate it using the Pydantic schema.


In [None]:
from pydantic import BaseModel
from typing import Optional, List

class PolymerExtraction(BaseModel):
    name: str
    synthesis_method: Optional[str] = None
    temperature: Optional[List[float]] = None
    homopolymer: Optional[bool] = None
    
class PolymerList(BaseModel):
    polymers: List[PolymerExtraction]

In [None]:
import os
os.environ["GROQ_API_KEY"] = "___" # TODO: add your groq api key here

In [None]:
import instructor
from pprint import pprint
from litellm import completion


# Patch den LiteLLM-Client für structured output
client = instructor.from_litellm(model="groq/llama3-70b-8192", completion=completion)

# Load input text (e.g. from a scientific article)
with open("example_paper.txt", "r") as f:
     polymer_text = f.read()

# Define user prompt with short instruction + input text
user_prompt = f"""Extract polymer data (name, synthesis method, temperature, homopolymer) from the following text:
{polymer_text}"""

extracted_data = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a polymer chemistry expert extracting structured data from scientific publications."},
        {"role": "user", "content": user_prompt}
    ],
    response_model=PolymerList,
    max_retries=2
)


# Output will be returned as a validated PolymerList instance
pprint(extracted_data.model_dump())

## Evaluations

Every extraction task needs a good way to evaluate whether the extracted data is correct and give it a score of how correct it is. The goal is to quantify the extraction pipeline’s (model’s) performance. With partial scores giving insight on how correct a data point is, usually between 0 and 1, the pipeline can be improved by fixing any edge cases or errors found by comparing lower scored data points.

To assess how well the LLM is performing, we need to compare its output against ground truth annotations.

A common starting point is to compute basic metrics like **Precision** and **Recall**:

- **Precision**: How many of the extracted items are correct?
- **Recall**: How many of the correct items were actually extracted?

In a first step, we will evaluate based on **exact string matches** without deeper matching logic.
Then we show a minimal example of how to include matching logic (e.g. fuzzy name similarity).


In [None]:
# Ground truth (annotated polymers)
ground_truth = ["polyethylene", "polystyrene", "poly(lactic acid)"]

# Model output (extracted polymers)
extracted = ["polyethylene", "polymer", "polymer X", "poly(lactic acid)"]

# Compute simple metrics
true_positives = len(set(ground_truth) & set(extracted))

precision = true_positives / (len(extracted) + 1e-8)
recall = true_positives / (len(ground_truth) + 1e-8)
f1_score = 2 / ((1 / recall) + (1 / precision))

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1_score:.2f}")

Feel free to modify the `ground_truth` or `extracted` lists above to explore how different prediction results affect the evaluation metrics.

### Matching and Fuzzy Comparison

In real-world evaluations, we typically compare the model’s predicted outputs to a set of manually annotated ground truth entries.

However, this is not just a matter of counting:  
- The model may return **more items than expected** (hallucinations),
- **Miss some relevant entries** (false negatives),
- Or express correct answers in **slightly different formats** (e.g. synonyms, abbreviations, typos).

To handle this, we need a way to **map predicted entries to ground truth items** before we compute metrics like precision and recall.  
This step is called **matching**.

A common first approach is **fuzzy string matching**, where we match two entries if they are similar above a certain threshold (instead of requiring exact equality).


In [None]:
from difflib import SequenceMatcher

# Define a fuzzy matching function using character similarity
def fuzzy_match(a, b, threshold):  # TODO: Try out different similarity thresholds
    """
    Returns True if strings `a` and `b` are similar above a certain threshold.
    Uses SequenceMatcher for character-based similarity.
    """
    ratio = SequenceMatcher(None, a.lower(), b.lower()).ratio()
    return ratio >= threshold 

# Ground truth annotations (gold standard)
ground_truth = ["polyethylene", "polystyrene", "poly(lactic acid)"]

# Model output (predictions)
predicted = ["polyethylene", "polymer", "poly(lactic acid)", "poly(lactic-acid)", "polylactic acid"]

threshold = 0.8 # <-- try different thresholds!

print(f"Matched pairs (fuzzy threshold ≥ {threshold):\n")

# Try to find matching pairs based on fuzzy similarity
for gt in ground_truth:
    for pred in predicted:
        if fuzzy_match(gt, pred, threshold):
            # TODO: Print the match pair
            print(f"✓ '{___}' ↔ '{___}'")


In more complex schemas involving multiple fields, lists, or nested structures, exact string matching is usually not sufficient.

Small variations in how information is expressed (e.g. formatting or synonyms) can lead to incorrect evaluation results.

To obtain **meaningful metrics**, it is essential to match the extracted entries to the ground truth before calculating metrics. 

### Normalize Before Matching

Before comparing values between model predictions and ground truth annotations, it is crucial to **normalize all units and formats**.  

Otherwise, two semantically equivalent values like `22 g` and `22000 mg` would be seen as a mismatch.

This applies to:
- **Physical quantities** like temperature, mass, or pressure
- **Chemical names or formulas**, which can be normalized to canonical representations (e.g. **SMILES** strings)

Below is an example of how to normalize numerical units using the [`pint`](https://pint.readthedocs.io/en/stable/) library.


In [None]:
from pint import UnitRegistry

# Ground truth annotation
truth = {"mass": {"value": 22.0, "unit": "g"}}

# Model prediction in different unit (but same value!)
prediction = {"mass": {"value": 22000.0, "unit": "mg"}}

# Initialize unit registry for physical quantities
ureg = UnitRegistry()

# Create a string representation: e.g. "22000.0 mg"
text_representation_of_value = (
    str(prediction["mass"]["value"]) + " " + prediction["mass"]["unit"]
)

print("Converting:", text_representation_of_value)

# Convert the predicted value to grams (g)
normalized_pint_quantity = ureg(___).to("g") # TODO: add the value to be converted

print("→", normalized_pint_quantity)

## Summary

In this notebook, we focused on the **postprocessing and evaluation** of structured outputs from LLM-based extraction systems:

- We used **constrained decoding** to extract data in a predefined JSON format.
- We defined a **Pydantic schema** to validate and structure the model’s output.
- We normalized units (e.g. mg → g, K → °C) to allow fair comparisons.
- We introduced **basic evaluation metrics** like precision, recall, and F1-score.
- Finally, we explored the importance of **fuzzy matching** to account for variations in wording, units, and structure.

These steps form the basis for building **trustworthy and reproducible extraction pipelines** for scientific applications.
