## Structured Extraction

**Structured extraction** is a process that transforms **unstructured text** into a **structured format**, such as JSON, making it easily accessible for further processing, analysis, or storage. 


To get the intution, lets implement a natural langugage processing parser that allows us to create structured tabular data from unstructued data. We can accomplish this by: 
1. Feed unstructured text into an LLM
2. Receive structured JSON output
3. Process the JSON into desired formats (e.g., dataframes)

The core way structured extraction works is via **Pydantic** classes: we define a data structure in Pydantic. The LLM works with Pydantic  to coerce the output of the LLM into that structure.

This property is crucial: these JSON-formatted schemas are often passed to LLMs and the LLMs in turn use them as instructions on how to return data.


With the rise of LLMs, this task can now be accomplished efficiently and affordably, enabling enterprises to unlock valuable insights from large volumes of unstructured data, including PDFs, text files, and scanned documents.



The chat function takes a list of messages, typically including a **system message** and a **user message**. The system message (your prompt) sets the context and instructions for the model, while the user message contains the actual data to be processed.

In [29]:
from ollama import chat
from pprint import pprint as pp

response = chat(
  messages=[
    {
      'role': 'user',
      'content': 'Tell me about Canada.',
    }
  ],
  model='llama3.1',
  options={'num_predict': 200} # Maximum number of tokens to predict when generating text
)
pp(response.message.content)

('Canada! Our neighbor to the north is a vast and beautiful country with a '
 "rich history, diverse culture, and breathtaking landscapes. Here's an "
 'overview:\n'
 '\n'
 '**Geography**\n'
 '\n'
 'Canada spans across the continent of North America, sharing borders with the '
 'United States (to the south), Alaska (to the northwest), and the Atlantic '
 'Ocean (to the east). It has a total area of approximately 10 million square '
 'kilometers (3.9 million sq mi), making it one of the largest countries in '
 'the world by land area.\n'
 '\n'
 'Canada is home to diverse landscapes, including:\n'
 '\n'
 '* Mountains: The Canadian Rockies, the Torngats, and the Appalachian '
 'Mountains\n'
 '* Rivers: The St. Lawrence, the Mackenzie, and the Fraser River\n'
 '* Coastlines: Canada has a long coastline along the Atlantic Ocean, as well '
 'as extensive coastlines on Hudson Bay, James Bay, and the Pacific Ocean\n'
 '\n'
 '**Climate**\n'
 '\n'
 "Canada's climate varies greatly depending on t

The `format` parameter allows you to specify a JSON schema that defines the structure of the desired output.  In our case, we are using a **Pydantic model** to generate this schema automatically. This tells the model exactly how to format its response.

In [31]:
from pydantic import BaseModel
class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]

pp(Country.model_json_schema())

{'properties': {'capital': {'title': 'Capital', 'type': 'string'},
                'languages': {'items': {'type': 'string'},
                              'title': 'Languages',
                              'type': 'array'},
                'name': {'title': 'Name', 'type': 'string'}},
 'required': ['name', 'capital', 'languages'],
 'title': 'Country',
 'type': 'object'}


In [27]:
from ollama import chat
from pydantic import BaseModel

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]

response = chat(
  messages=[
    {
      'role': 'user',
      'content': 'Tell me about Canada.',
    }
  ],
  model='llama3.1',
  format=Country.model_json_schema(),
)

country = Country.model_validate_json(response.message.content)
print(country)

name='Canadá' capital='Ottawa' languages=['English', 'French']


The Llama 3.1 model processes the input messages and generates a response that adheres to the specified JSON schema. It uses the context from the system message to understand its role as a data parsing assistant and formats the extracted information according to the provided schema.

In [11]:
from ollama import chat
from pydantic import BaseModel
from typing import Optional, List  # Add List from typing
from pydantic import BaseModel, EmailStr, HttpUrl
from ollama import chat
from pprint import pprint as pp

prompt = """
You are a data parsing assistant. 
User provides unstructued data containing addresses. 
Your goal is to output it as JSON.
"""
data =  """
The Ottawa Public Library is at 150 Elgin Street, Ottawa.
Down the street, Sarah Wilson runs her bakery at 240 Laurier Avenue, Ottawa.
Over in Kanata, Tech Corp's office is at 1385 Terry Fox Drive.
"""

class Address(BaseModel):
    name: str
    street_number: str
    street_name: str
    city: str


class Addresses(BaseModel):
    addresses: list[Address]

In [18]:
pp(Addresses.model_json_schema())

{'$defs': {'Address': {'properties': {'city': {'title': 'City',
                                               'type': 'string'},
                                      'name': {'title': 'Name',
                                               'type': 'string'},
                                      'street_name': {'title': 'Street Name',
                                                      'type': 'string'},
                                      'street_number': {'title': 'Street '
                                                                 'Number',
                                                        'type': 'string'}},
                       'required': ['name',
                                    'street_number',
                                    'street_name',
                                    'city'],
                       'title': 'Address',
                       'type': 'object'}},
 'properties': {'addresses': {'items': {'$ref': '#/$defs/Address'},
                

In [9]:
response = chat(
    model='llama3.1:8b',
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": data},
    ]
 ,
  format=Addresses.model_json_schema(),  # Use Pydantic to generate the schema or format=schema
  options={'temperature': 0},  # Make responses more deterministic
)

In [17]:
pp(response.message.content)

('{\n'
 '  "addresses": [\n'
 '    {\n'
 '      "name": "Ottawa Public Library",\n'
 '      "street_number": "150",\n'
 '      "street_name": "Elgin Street",\n'
 '      "city": "Ottawa"\n'
 '    },\n'
 '    {\n'
 '      "name": "Sarah Wilson\'s Bakery",\n'
 '      "street_number": "240",\n'
 '      "street_name": "Laurier Avenue",\n'
 '      "city": "Ottawa"\n'
 '    },\n'
 '    {\n'
 '      "name": "Tech Corp Office",\n'
 '      "street_number": "1385",\n'
 '      "street_name": "Terry Fox Drive",\n'
 '      "city": "Kanata"\n'
 '    }\n'
 '  ]\n'
 '}')


In [20]:
# Use Pydantic to validate the response
address_ouput = Addresses.model_validate_json(response.message.content)
pp(address_ouput)

Addresses(addresses=[Address(name='Ottawa Public Library', street_number='150', street_name='Elgin Street', city='Ottawa'), Address(name="Sarah Wilson's Bakery", street_number='240', street_name='Laurier Avenue', city='Ottawa'), Address(name='Tech Corp Office', street_number='1385', street_name='Terry Fox Drive', city='Kanata')])


In [21]:
import pandas as pd

def locations_to_df(addresses: Addresses) -> pd.DataFrame:
    """
    Convert locations Pydantic model to pandas DataFrame.
    Each row represents one location with all its fields.
    """
    # Convert each BusinessLocation to a dict and create DataFrame
    return pd.DataFrame([loc.model_dump() for loc in addresses.addresses])
# Assuming we have our parsed response in parsed_locations
df = locations_to_df(address_ouput)
df

Unnamed: 0,name,street_number,street_name,city
0,Ottawa Public Library,150,Elgin Street,Ottawa
1,Sarah Wilson's Bakery,240,Laurier Avenue,Ottawa
2,Tech Corp Office,1385,Terry Fox Drive,Kanata


As mentioned, LLMs are using JSON schemas from Pydantic as instructions on how to return data. To assist them and improve the accuracy of your returned data, it’s helpful to include natural-language descriptions of objects and fields and what they’re used for. Pydantic has support for this with docstrings and Fields.

from pydantic import BaseModel, Field
from typing import List

class Address(BaseModel):
    """A representation of a physical address with associated business/person name."""
    
    name: str = Field(
        description="Name of the business or person associated with this address"
    )
    street_number: str = Field(
        description="Building number or unit identifier"
    )
    street_name: str = Field(
        description="Name of the street, including type (Street, Avenue, Boulevard, etc.)"
    )
    city: str = Field(
        description="Name of the city where this address is located"
    )

class Addresses(BaseModel):
    """A collection of physical addresses, typically extracted from text."""
    
    addresses: List[Address] = Field(
        description="List of addresses extracted from the source text"
    )

This expands to a much more complex JSON schema:

In [24]:
pp(Addresses.model_json_schema())

{'$defs': {'Address': {'description': 'A representation of a physical address '
                                      'with associated business/person name.',
                       'properties': {'city': {'description': 'Name of the '
                                                              'city where this '
                                                              'address is '
                                                              'located',
                                               'title': 'City',
                                               'type': 'string'},
                                      'name': {'description': 'Name of the '
                                                              'business or '
                                                              'person '
                                                              'associated with '
                                                              'this address',
                   

When an LLM receives this schema, it has much more context about what each field represents and how it should be used, which can improve the accuracy of the data it returns.
