# Structured information extraction with a small LLM

In this notebook with will use an alternative more standard approach by just prompting the LLM to extract the desired information. Later on we will define an output schema with pydantic to extract the structured ouput in a defined way.

### Insurance Case Data

First we'll load the same insurance claim reports as in the previous notebook and look at the first claim report:

In [1]:
import json
# Load cases
cases_jsonl = '../testing/claims/insurance_claim_reports.jsonl'
cases = []

# Open the JSONL file and load each line as a dictionary
with open(cases_jsonl, 'r', encoding='utf-8') as file:
    for line in file:
        cases.append(json.loads(line.strip()))

In [2]:
print(cases[0]['description'])


        Dear Parasol Insurance,

        My name is Eric Cline, and I am writing to file a claim for a recent car accident that occurred on 2024-01-06, 
        at approximately 6:30 PM. My policy number is BC-857143475.

        The accident took place at the intersection of Elm Ln and Pine Blvd. I was driving my vehicle, a yellow Ford Traverse with license plate 
        number 614 7962. At the same time, another vehicle, a blue BMW Civic with license plate number 094-RGL, 
        collided with my car. The driver, Samantha Juarez, failed to adhere to traffic rules, resulting in damage to both vehicles.

        I promptly exchanged information with the other driver and took photos of the accident scene, including damages to both vehicles.
        Attached to this email are the photos, a copy of the police report, and the estimate for the repair costs.

        Kindly assist in processing this claim and let me know the next steps. You can reach me at 242.261.8544 or bairddennis@vazq

### Framework Setup

Now we'll set up Agno again and point it to our LLM deployed with Ollama.

In [3]:
from agno.agent import Agent
from agno.models.ollama import Ollama

In [4]:
OLLAMA_HOST="localhost"
#OLLAMA_HOST="semantic-sonnenschirm-predictor.demo.svc.cluster.local"

In [5]:
# Test Ollama client connectivity
from ollama import Client
client = Client(
  host='http://'+OLLAMA_HOST+':11434',
)

### Extraction by prompting

First we'll define a simple agent that extracts information.

In [6]:
simple_agent = Agent(
    model=Ollama(id="granite3.3", host=OLLAMA_HOST),
    description="You extract information.",
)

No we'll just prompt the agent to extract the desired information:

In [7]:
result = simple_agent.run("Extract the customer name, policy number, telephone number, email address and address from the following text: " + cases[0]['description'])

In [8]:
print(result.content)

1. Customer Name: Eric Cline
2. Policy Number: BC-857143475
3. Telephone Number: 242.261.8544
4. Email Address: bairddennis@vazquez.com
5. Address: 0801 Lauren Alley, East William, MS 43056


As you can see this time the LLM was even able to extract the customer address. The output is text only and unstructured. Different LLMs will output in slightly different ways.

### Structured Extraction with Pydantic

Now we will use Pydantic to define a Customer class that inherits from the BaseModel and describes what information we want to extract in which way.

In [9]:
from pydantic import BaseModel, Field

In [10]:
class Customer(BaseModel):
    name: str = Field(..., description="Name of customer")
    policy_number: str = Field(..., description="Policy Number")
    telephone_number: str = Field(..., description="Telephone Number")
    email_address: str = Field(..., description="Email Address")
    address: str = Field(..., description="Customer Address")

For convenience all of of our fields are strings. In order to help the LLM understand what we want to extract we annotate them with a description.

In [11]:
extract_agent = Agent(
    model=Ollama(id="granite3.3", host=OLLAMA_HOST),
    description="You extract information.",
    output_schema=Customer,
)

In [12]:
result = extract_agent.run(cases[0]['description'])

The result is now of the type Customer class that we defined earlier on:

In [13]:
type(result.content)

__main__.Customer

In [14]:
from rich.pretty import pprint
pprint(result.content)

We use pretty print to visualize the result. Perfect extraction, we now have all the desired information in the way we defined it. However there might be a caveat. Let's have a look at another insurance claim description.

In [15]:
print(cases[1]['description'])


        Dear Parasol Insurance,

        My name is Aaron Bowman, and I am writing to file a claim for a recent car accident that occurred on 2024-01-02, 
        at approximately 11:45 AM. My policy number is AC-584790380.

        The accident took place at the intersection of Maple Rd and Main St. I was driving my vehicle, a maroon Nissan Elantra with license plate 
        number 285 3YT. At the same time, another vehicle, a silver Honda Traverse with license plate number 1DZ S60, 
        collided with my car. The driver, Douglas Small, failed to adhere to traffic rules, resulting in damage to both vehicles.

        I promptly exchanged information with the other driver and took photos of the accident scene, including damages to both vehicles.
        Attached to this email are the photos, a copy of the police report, and the estimate for the repair costs.

        Kindly assist in processing this claim and let me know the next steps. You can reach me at (909)412-1229 or bestrad

Notice that the address in this case is a DPO address, a Diplomatic Post Office address for U.S. diplomatic locations abroad. Let's try to extract information from this example and see what happens:

In [16]:
result = extract_agent.run(cases[1]['description'])
pprint(result.content)

Unfortunately the LLM not only extracted the information but added it's own information about the DPO address.
Let's see if we can fix this by instructing the LLM to only extract information.

In [17]:
extract_agent = Agent(
    model=Ollama(id="granite3.3", host=OLLAMA_HOST),
    description="You only extract information. Don't add any additional details ONLY extract.",
    output_schema=Customer,
)

In [18]:
result = extract_agent.run(cases[1]['description'])
pprint(result.content)

Perfect, just like we wanted. This is an important lesson to always look at the data first and do proper evaluation.