## Evaluating Structured Outputs in LLM Pipelines

### Motivation
Many real-world LLM applications rely on structured outputs, such as JSON, schemas, or key–value records, that are consumed directly by downstream systems. In these settings, failures are often silent: an output may appear fluent and plausible while violating structural constraints in ways that break execution or corrupt data. This notebook treats structured-output evaluation as a way to study how reliably models adhere to explicit specifications under generative uncertainty.

Rather than focusing on natural language correctness, the emphasis here is on whether models can consistently map unstructured inputs to well-formed, machine-readable representations.

###Experimental Setup
We evaluate model outputs that are intended to follow a predefined schema. Instead of binary validity checks alone, we use distance-based metrics (e.g. edit distance over JSON structures) to quantify how an output deviates from the target structure when it fails.

This allows us to distinguish between superficial formatting errors and deeper semantic or structural mismatches, which often have very different implications in production systems.

### Prerequisites and Setup

- `langchain`, `langsmith`, `langchain_experimental`: The core libraries for building the chain, connecting to LangSmith for evaluation, and accessing experimental features like Anthropic Functions.
- `anthropic`: The Python SDK for the Anthropic (Claude) API.
- `jsonschema`: A dependency used by LangChain's extraction tools to validate the structure of the output.

In [None]:
# The '%pip install' command installs python packages from the notebook.
# -U ensures we get the latest versions.
# --quiet suppresses the installation output for a cleaner interface.
%pip install -U --quiet langchain langsmith langchain_experimental anthropic jsonschema

- **`LANGCHAIN_API_KEY`**: Your secret key for authenticating with LangSmith, which enables logging and evaluation.
- **`ANTHROPIC_API_KEY`**: Your secret key for the Anthropic API, required to use Claude models.
- **`LANGCHAIN_ENDPOINT`**: This URL directs all LangChain tracing data to the LangSmith platform.

In [None]:
import os # Import the 'os' module to interact with the operating system's environment variables.
import uuid # Import the 'uuid' module, though it is not used in this specific cell, it's good practice for creating unique IDs.

uid = uuid.uuid4() # Create a unique identifier object (not used here but declared).
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Set your LangSmith API key as an environment variable.
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-***" # Set your Anthropic (Claude) API key as an environment variable.
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.

### Create the Evaluation Dataset

For this task, we need examples that pair unstructured text (a contract) with its corresponding structured representation (the filled-out contract details).

We will use a pre-existing public dataset in LangSmith, which is derived from the Contract Understanding Atticus Dataset (CUAD). LangSmith's ability to share and clone datasets is incredibly useful for collaboration and reproducibility. The `clone_public_dataset` function will copy this public dataset into your own LangSmith account, allowing you to run evaluations against it.

You can explore the original public dataset here: [Contract Extraction Dataset](https://smith.langchain.com/public/08ab7912-006e-4c00-a973-0f833e74907b/d).

In [1]:
from langsmith import Client # Import the Client class to interact with the LangSmith API.

# The URL of the public dataset we want to use.
dataset_url = (
    "https://smith.langchain.com/public/08ab7912-006e-4c00-a973-0f833e74907b/d"
)
# Define a name for our local copy of the dataset.
dataset_name = f"Contract Extraction"

# Instantiate the LangSmith client.
client = Client()
# Use the client to clone the public dataset into your LangSmith account.
client.clone_public_dataset(dataset_url)

### Define the Extraction Chain

The first step in defining our chain is to create the target schema using Pydantic. This tells the LLM exactly what information to look for and how to structure it.

In [2]:
from typing import List, Optional, Union # Import typing hints for defining our data models.

from pydantic import BaseModel # Import BaseModel from Pydantic to create our data schemas.


# Define the schema for a physical address.
class Address(BaseModel):
    street: str # The street address.
    city: str # The city.
    state: str # The state or province.
    zip_code: str # The postal or zip code.
    country: Optional[str] # The country, marked as optional.


# Define the schema for a party involved in the contract.
class Party(BaseModel):
    name: str # The name of the party.
    address: Address # A nested Address object.
    type: Optional[str] # The type of party (e.g., "Landlord", "Tenant"), marked as optional.


# Define the schema for a single section of the contract.
class Section(BaseModel):
    title: str # The title of the section.
    content: str # The full text content of the section.


# Define the top-level schema for the entire contract.
class Contract(BaseModel):
    document_title: str # The main title of the contract document.
    exhibit_number: Optional[str] # Any exhibit number associated with the contract, optional.
    effective_date: str # The date the contract becomes effective.
    parties: List[Party] # A list of Party objects.
    sections: List[Section] # A list of Section objects.

With our Pydantic schema defined, we can now construct the full extraction chain. This chain will take the raw text of a contract as input and produce the structured `Contract` object as output. We will use LangChain Expression Language (LCEL) to pipe the components together.

In [20]:
from langchain import hub # Import the hub to pull pre-made prompts.
from langchain.chains import create_extraction_chain # Import a helper function to easily create an extraction chain.
from langchain_anthropic import ChatAnthropic # Import the Anthropic chat model wrapper.
from langchain_experimental.llms.anthropic_functions import AnthropicFunctions # Import the experimental Anthropic Functions wrapper.

# Pull a prompt that is specifically designed for contract extraction with Anthropic models.
contract_prompt = hub.pull("wfh/anthropic_contract_extraction")


# Create the core extraction logic as a sub-chain.
extraction_subchain = create_extraction_chain(
    Contract.schema(), # Provide the Pydantic schema that we want to extract.
    llm=AnthropicFunctions(model="claude-2.1", max_tokens=20_000), # Use the Anthropic Functions LLM, specifying a model and max tokens.
    prompt=contract_prompt, # Provide the specialized prompt from the hub.
)
# The dataset provides input with a key named "context", but our extraction_subchain expects a key named "input".
# We use LCEL to create a final chain that correctly maps the keys.
chain = (
    # This lambda function takes the original input 'x' and transforms it into the format the subchain expects.
    (lambda x: {"input": x["context"]})
    | extraction_subchain # The transformed input is piped into our extraction subchain.
    # The subchain's output is `{'text': [...]}`. We transform it again to `{'output': [...]}` for the evaluator.
    | (lambda x: {"output": x["text"]})
)

### Evaluate the Chain

For structured data like JSON, a simple string comparison is often too strict. The order of keys in a JSON object can change without altering the meaning of the data. 

To handle this, we'll use the **`json_edit_distance`** evaluator. This is a powerful, non-LLM-based evaluator that works as follows:
1.  **Canonicalization**: It takes both the predicted JSON and the reference JSON and standardizes them. This typically involves sorting all keys alphabetically at every level of the object.
2.  **String Conversion**: It converts the canonical JSON objects into strings.
3.  **Edit Distance**: It calculates the Damerau-Levenshtein edit distance between the two strings. This measures the number of insertions, deletions, substitutions, and transpositions required to change one string into the other.
4.  **Normalization**: The raw distance is normalized to produce a similarity score between 0.0 (completely different) and 1.0 (identical).

Before running the evaluation, we'll adjust the logging level. The legal documents are very long, and if an error occurs during processing, the default logging behavior might print the entire document to the screen, cluttering the notebook. By setting the logging level to `CRITICAL`, we ensure that only the most severe errors are displayed.

In [13]:
import logging # Import the logging module to control log message verbosity.

# We will suppress any non-critical errors here since the documents are long
# and could pollute the notebook output with excessive text.
logger = logging.getLogger() # Get the root logger.
logger.setLevel(logging.CRITICAL) # Set its level to CRITICAL, so only messages of that severity or higher will be shown.

Finally, we run the evaluation using LangSmith's `evaluate` function. This function orchestrates the entire process: it takes our chain, runs it on each example from our dataset, and then applies the specified evaluator to score the results.

In [22]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate # Import the necessary evaluation functions from LangSmith.

# Call the main evaluate function to run the experiment.
res = evaluate(
    chain.invoke, # The function to be tested. `chain.invoke` is the standard way to run a chain.
    data=dataset_name, # The name of the dataset in LangSmith to run on.
    evaluators=[LangChainStringEvaluator("json_edit_distance")], # A list of evaluators to apply to the results.
    # To avoid hitting API rate limits, we can limit the number of concurrent requests.
    max_concurrency=2,
)

  res = evaluate(


View the evaluation results for experiment: 'ordinary-hat-82' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/fbc1a41c-7043-4b5f-b6e8-78266faac187/compare?selectedSessions=02fbe581-47ae-4c87-bcad-7a8c44e8789b




0it [00:00, ?it/s]

### Analyze the Results

- Look at examples with low scores. What kind of errors is the model making? Is it missing entire sections, or just making small mistakes in fields like dates or addresses?
- Are there any examples where the model seems to hallucinate information that wasn't in the original text?
- Could the reference data in the dataset be improved? Sometimes, evaluation failures can point to ambiguities or errors in the ground truth.

### What Structured Validation Reveals
Structured evaluation is effective at detecting:
- partial compliance with a schema,
- hallucinated or missing fields,
- incorrect nesting or type violations,
- outputs that are semantically plausible but operationally invalid.

These failure modes are frequently missed by semantic evaluators, which may reward fluency while overlooking violations that would cause downstream systems to fail.

### Limitations and Trade-offs
While structured validation provides clear signals about specification adherence, it does not assess whether the content of the output is correct or meaningful. A perfectly valid schema can still encode incorrect information. As such, structured evaluation should not be interpreted as a proxy for correctness, but as a complementary diagnostic focused on interface reliability.

### Role in a Broader Evaluation Framework
In this project, structured-output evaluation functions as a boundary check between generative models and deterministic systems. When combined with semantic evaluation and trajectory-level analysis, it helps localise where failures occur: whether a model misunderstood the task, violated constraints during generation, or produced a structurally valid but incorrect result.

This distinction is critical for debugging and for deciding where to intervene—prompting, decoding, post-processing, or system design.

## Discussion
As LLMs are increasingly embedded in pipelines that expect precise, machine-readable outputs, structural reliability becomes a first-order concern. This notebook demonstrates how structured validation can be used not only to catch errors, but to characterise the kinds of failures models make when translating between unstructured reasoning and formal representations.