# Introduction

LLMs can generate text that can veer off-topic or contain fabricated details. This unpredictability poses a serious challenge for organizations that need to distill consistent, structured data (e.g., addresses, invoices, summaries) from large volumes of unstructured text. When most people talk about "AI" (like ChatGPT) they picture them generating free-form text that can range from a coherent explanation to a complete tangent.


This often makes LLMs too unpredictable and unreliable for real-world tasks, especially in enterprise or mission-critical settings. These criticisms are completely valid since LLMs are probabilistic, generative systems prone to variability. Traditional unstructured text output is hard to trust if the LLM “hallucinates” or drift off-topic.


**Structured extraction** is an approach that helps partially mitigate these concerns.

Instead of requesting open-ended text, we give the model a **blueprint** (a strictly defined schema and demand the output adhere to it). We specify *exactly how we want the model’s output to look*—in this case, structured data in JSON. Then, we use **Pydantic** (a Python library) to formally define and validate that output by enforcing type definitions and validates JSON data

This constrained use of AI addresses one of the biggest concerns: **unpredictability**. By requiring the model to produce output that follows a strict schema, any errors become more detectable, and the output can be systematically processed. Structured extraction places a guardrail around the model:


1. Define a schema describing exactly what fields and data types are expected.
2. Instruct the LLM to return JSON conforming to that schema.
2. If the LLM returns something that doesn’t match the schema, we can immediately detect the error and handle it (e.g., ask the model again, or flag the output).



This approach reduces the “black-box” feeling of AI text generation, making outputs more transparent, testable, and suitable for downstream workflows. By constraining the model’s responses to a known structure, we transform what might otherwise be a black-box AI process into a predictable and testable system—precisely the kind of safeguard critics demand.

Below, we illustrate how to implement structured extraction in practice, using code samples to demonstrate each step of guiding an LLM to produce verifiable JSON responses.

## How Structured Extraction Works 

Before demonstrating how to guide a Large Language Model (LLM) toward a strict schema, it’s instructive to see what happens without any constraints. In the following example, we simply prompt the model about Canada, only limiting the number of tokens (for brevity) rather than specifying any structured output.

Here we're just using the Ollama library and using the LLaMA 3.18b model, running it locally on our machine. 

In [19]:
from ollama import chat
from pprint import pprint as pp

response = chat(
  messages=[
    {
      'role': 'user',
      'content': 'Tell me about Canada.',
    }
  ],
  model='llama3.1',
  options={'num_predict': 150} # Maximum number of tokens to predict when generating text
)
pp(response.message.content)

('Canada! The Great White North is a vast and beautiful country located in '
 "North America, sharing the world's longest international border with the "
 'United States to the south. Here are some interesting facts about Canada:\n'
 '\n'
 '**Geography**\n'
 '\n'
 'Canada is a large country, covering an area of approximately 10 million '
 'square kilometers (3.9 million sq mi). It has a diverse geography, featuring '
 'mountains (including the Canadian Rockies), forests, lakes, rivers, and '
 'coastlines along the Atlantic, Pacific, and Arctic Oceans. The country can '
 'be divided into several regions: Western Canada, including British Columbia '
 'and Alberta; Central Canada, comprising provinces like Ontario and Quebec; '
 'and Eastern Canada, which includes provinces like Nova Scotia and New '
 'Brunswick.\n'
 '\n'
 '**Cities**\n'
 '\n'
 'Some of Canada')


As expected, the output is an unstructured descriptive paragraph.


### From Free-Form to Structured

Now we build on existing code examples to illustrate how an LLM can be guided to output structured ouput according to a schema. To replace this open-ended style with a structured extraction approach, we’ll use Pydantic to define a schema (i.e., which fields we expect and in what format) and then instruct the LLM to output only valid JSON matching that schema.
#### Defining a Pydantic Model

In [1]:
from pydantic import BaseModel

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]

Pydantic will automatically generate a JSON Schema to describe the fields. This schema will look like:

In [3]:
Country.model_json_schema()

{'properties': {'name': {'title': 'Name', 'type': 'string'},
  'capital': {'title': 'Capital', 'type': 'string'},
  'languages': {'items': {'type': 'string'},
   'title': 'Languages',
   'type': 'array'}},
 'required': ['name', 'capital', 'languages'],
 'title': 'Country',
 'type': 'object'}

###  Sending the Schema to the LLM

When we call the LLM, we include this JSON schema in the request so that the model knows **what** to return and **how** to format it. For instance:

In [4]:
from ollama import chat

response = chat(
  messages=[
    {
      'role': 'user',
      'content': 'Tell me about Canada.',
    }
  ],
  model='llama3.1',
  format=Country.model_json_schema(),
)

This tells the model: “Return your answer as valid JSON that matches the schema for Country.” Instead of a free-form reply, the model is instructed to produce something like:

In [5]:
response.message.content

'{ "name" : "Canada", "capital" : "Ottawa", "languages" : ["English", "French"] }'

### Validating the Output
Once the model responds, we validate the JSON to confirm it meets the schema:

In [6]:
country = Country.model_validate_json(response.message.content)
print(country)

name='Canada' capital='Ottawa' languages=['English', 'French']


If the output fails to match the schema (for example, the LLM omits the languages field), Pydantic raises an error. This immediate feedback loop is crucial for reliability. **We’re not simply trusting the AI to always comply; we’re enforcing compliance.**

## Concrete Example: Addresses
Here’s a more realistic use case: parsing physical addresses from a piece of text. We define two Pydantic models, Address and Addresses:

In [7]:
from pydantic import BaseModel

class Address(BaseModel):
    name: str
    street_number: str
    street_name: str
    city: str

class Addresses(BaseModel):
    addresses: list[Address]

Let’s imagine the user has a piece of text describing several locations:

In [10]:
data =  """
The Ottawa Public Library is at 150 Elgin Street, Ottawa.
Down the street, Sarah Wilson runs her bakery at 240 Laurier Avenue, Ottawa.
Over in Kanata, Tech Corp's office is at 1385 Terry Fox Drive.
"""

We pass these messages to the LLM. The Llama 3.1 model processes the input messages and generates a response that adheres to the specified JSON schema. It uses the context from the system message to understand its role as a data parsing assistant and formats the extracted information according to the provided schema.

In [13]:
prompt = """
You are a data parsing assistant. 
User provides unstructued data containing addresses. 
Your goal is to output it as JSON.
"""
response = chat(
    model='llama3.1:8b',
    messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": data}
    ],
    format=Addresses.model_json_schema(),  # Use Pydantic to generate the schema or format=schema
    options={'temperature': 0},  # Make responses deterministic
)

This is the raw output from the model now. 

In [14]:
pp(response.message.content)

('{\n'
 '  "addresses": [\n'
 '    {\n'
 '      "name": "Ottawa Public Library",\n'
 '      "street_number": "150",\n'
 '      "street_name": "Elgin Street",\n'
 '      "city": "Ottawa"\n'
 '    },\n'
 '    {\n'
 '      "name": "Sarah Wilson\'s Bakery",\n'
 '      "street_number": "240",\n'
 '      "street_name": "Laurier Avenue",\n'
 '      "city": "Ottawa"\n'
 '    },\n'
 '    {\n'
 '      "name": "Tech Corp Office",\n'
 '      "street_number": "1385",\n'
 '      "street_name": "Terry Fox Drive",\n'
 '      "city": "Kanata"\n'
 '    }\n'
 '  ]\n'
 '}')


After validating the JSON against our Addresses model, we can convert it into a pandas DataFrame. 

In [17]:
# Use Pydantic to validate the response
address_ouput = Addresses.model_validate_json(response.message.content)
pp(address_ouput)

Addresses(addresses=[Address(name='Ottawa Public Library', street_number='150', street_name='Elgin Street', city='Ottawa'), Address(name="Sarah Wilson's Bakery", street_number='240', street_name='Laurier Avenue', city='Ottawa'), Address(name='Tech Corp Office', street_number='1385', street_name='Terry Fox Drive', city='Kanata')])


In [18]:
import pandas as pd

def locations_to_df(addresses: Addresses) -> pd.DataFrame:
    """
    Convert locations Pydantic model to pandas DataFrame.
    Each row represents one location with all its fields.
    """
    # Convert each BusinessLocation to a dict and create DataFrame
    return pd.DataFrame([loc.model_dump() for loc in addresses.addresses])
# Assuming we have our parsed response in parsed_locations
df = locations_to_df(address_ouput)
df

Unnamed: 0,name,street_number,street_name,city
0,Ottawa Public Library,150,Elgin Street,Ottawa
1,Sarah Wilson's Bakery,240,Laurier Avenue,Ottawa
2,Tech Corp Office,1385,Terry Fox Drive,Kanata


## Conclusion

LLMs support structured outputs making it possible to constrain a model's output to a specific format defined by JSON schema. 
**Structured extraction** is a process that transforms **unstructured text** into a **structured format**, such as JSON, making it easily accessible for further processing, analysis, or storage. 

Structured outputs have many use cases includuing: 
1. Parsing data from documents
2. Extracting data from images
3. Structuring all language model responses

With the rise of LLMs, this task can now be accomplished efficiently and affordably, enabling enterprises to unlock valuable insights from large volumes of unstructured data, including PDFs, text files, and scanned documents.



In [3]:
import os
from openai import OpenAI
from openai.types.chat import ChatCompletion  


def eval(prompt: str, message: str, model: str = "gpt-4o") -> ChatCompletion:
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    return client.chat.completions.create(
        model=model,
        messages=messages
    )



res = eval(prompt=prompt, message=data)
json_data = res.choices[0].message.content

print(json_data)

```json
[
    {
        "name": "Ottawa Public Library",
        "address": "150 Elgin Street",
        "city": "Ottawa"
    },
    {
        "name": "Sarah Wilson's Bakery",
        "address": "240 Laurier Avenue",
        "city": "Ottawa"
    },
    {
        "name": "Tech Corp's office",
        "address": "1385 Terry Fox Drive",
        "city": "Kanata"
    }
]
```


We can see that the model didn't return JSON, it returned markdown formated string containing JSON. The reason is that we didn't enable structured output in the API call.

In [4]:
def eval(prompt: str, message: str, model: str = "gpt-4o") -> ChatCompletion:
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    return client.chat.completions.create(
        model=model,
        messages=messages,
        # Enable strctured output
        response_format={"type": "json_object"},
    )
prompt = """
You are a data parsing assistant. 
User provides unstructued data containing addresses. 
Your goal is to output it as JSON.
"""
data =  """
The Ottawa Public Library is at 150 Elgin Street, Ottawa.
Down the street, Sarah Wilson runs her bakery at 240 Laurier Avenue, Ottawa.
Over in Kanata, Tech Corp's office is at 1385 Terry Fox Drive.
"""


res = eval(prompt=prompt, message=data)
json_data = res.choices[0].message.content

print(json_data)

{
  "addresses": [
    {
      "name": "Ottawa Public Library",
      "address": "150 Elgin Street",
      "city": "Ottawa"
    },
    {
      "name": "Sarah Wilson's Bakery",
      "address": "240 Laurier Avenue",
      "city": "Ottawa"
    },
    {
      "name": "Tech Corp",
      "address": "1385 Terry Fox Drive",
      "city": "Kanata"
    }
  ]
}


Now, running the same code returns plain JSON. This is not only great because we don't need to parse anything extra but, but it also guarantees that the LLM won't include any free-from text such as "Sure, here is your data!{}"


**The problem is, we don't have the data shaped defined; lets call it *schema*. Our schema is now up to the LLM, and it might change based on user input**. Lets reformat the data to see it in action. 

In [5]:
# Messy web-scraped format with typical HTML artifacts and inconsistent formatting
data_2 = """
[Search Results]
* Ottawa Public Library *
Contact Us > Main Branch
Located at: &nbsp;150 Elgin Street, Ottawa
Status: OPEN NOW! 📚
Customer Service: library@ottawa.ca
Visit us online: https://biblioottawalibrary.ca
Postal: K1P 1L7
---------------------
<div class="business-listing">
Sarah's Bakery & Café [⭐️4.8]
Address line 1: 240 
Address line 2: Laurier Avenue
City: Ottawa
Postal Code: K1P 5J7
Contact: sarah@sarahsbakery.ca
www.sarahsbakery.ca
</div>
...Read More...
---------------------
TECH CORP GLOBAL
www.techcorp.com/contact
📍 1385 Terry Fox Drive
Kanata, Ontario
[Click to view map]
Email: info@techcorp.com
Postal: K2K 3K2
"""

res = eval(prompt=prompt, message=data_2)
json_data = res.choices[0].message.content

print(json_data)

{
    "results": [
        {
            "name": "Ottawa Public Library",
            "branch": "Main Branch",
            "address": {
                "street": "150 Elgin Street",
                "city": "Ottawa",
                "postal_code": "K1P 1L7"
            },
            "contact": {
                "email": "library@ottawa.ca",
                "website": "https://biblioottawalibrary.ca"
            },
            "status": "Open Now"
        },
        {
            "name": "Sarah's Bakery & Café",
            "address": {
                "line_1": "240 Laurier Avenue",
                "city": "Ottawa",
                "postal_code": "K1P 5J7"
            },
            "contact": {
                "email": "sarah@sarahsbakery.ca",
                "website": "www.sarahsbakery.ca"
            },
            "rating": "4.8"
        },
        {
            "name": "TECH CORP GLOBAL",
            "address": {
                "street": "1385 Terry Fox Drive",
                "

## Enforcing a Strict Schema (Pydantic Example)
Usually, JSON ouput won't cut it in software systems. Its just a string after all. We have to ensure that the LLM indeed returns correctly formed data. 

To robustly handle data after extracting it, you often need a well-defined schema. Tools like pydantic are excellent for validating that the JSON from the LLM matches the expected structure.
  - **BaseModel**: Pydantic's base class that enables data validation
   - **Optional**: From Python's typing module, marks fields that aren't required
   - **EmailStr**: Pydantic's email validator, ensures valid email format
   - **HttpUrl**: Pydantic's URL validator, ensures valid website format
   
   This model will:
   1. Ensure required fields (name, street info, city) are present
   2. Allow optional fields to be missing
   3. Validate email format if provided
   4. Validate URL format if provided

In [8]:
from pydantic import BaseModel, EmailStr, HttpUrl
from typing import Optional

class BusinessLocation(BaseModel):
    name: str
    street_number: str
    street_name: str
    city: str
    postal_code: Optional[str] = None
    email: Optional[EmailStr] = None
    website: Optional[HttpUrl] = None

In [6]:
def eval(prompt: str, message: str, model: str = "gpt-4o", model) -> ChatCompletion:
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    return client.chat.completions.create(
        model=model,
        messages=messages,
        # Enable strctured output
        response_format=model,
    )

In [None]:
res = eval(prompt=prompt, message=data_2, model)

In [1]:
from datetime import datetime
from pydantic import BaseModel, constr

class Business(BaseModel):
    id: int
    business_name: str = 'Unregistered Business'
    registration_ts: datetime | None
    business_number: constr(pattern=r'^\d{9}$')  # 9-digit pattern
    naics_code: constr(pattern=r'^\d{6}$')       # 6-digit pattern

In [2]:
external_data = {
    'id': 123,
    'business_name': 'Hanan Ather Pharmacy',
    'registration_ts': '2023-01-15 09:30',
    'business_number': '123456789',
    'naics_code': '446110'  # Actual NAICS code for pharmacies
}

business = Business(**external_data)
pp(business.model_dump())  # Convert the model to a dictionary

{'id': 123, 'business_name': 'Hanan Ather Pharmacy', 'registration_ts': datetime.datetime(2023, 1, 15, 9, 30), 'business_number': '123456789', 'naics_code': '446110'}


If the LLM ever returns incorrectly shaped data (e.g., the `id` is missing or the `business_number` is only 8 digits), a `ValidationError` will be raised. This helps ensure you don’t inadvertently store invalid or incomplete data.

