# üí° Fast, Accurate Parsing of Utility Bills with LandingAI

This notebook demonstrates how to use the `landingai-ade` Python package to extract structured information from documents using LandingAI's Agentic Document Extraction (ADE) service. It uses electric bills collected from major utility providers across the United States.

We'll walk through:
- Parsing documents with Agentic Document Extraction. (ADE)
- Defining a custom schema using JSON
- Viewing structured field extractions and metadata.
- Saving results to CSV



This notebook demonstrates how to use the `agentic_doc` Python package to extract structured information from utility bills using LandingAI's Agentic Document Extraction (ADE) service. It uses electric bills collected from major utility providers across the United States.

We'll walk through:
- Parsing documents with Agentic Document Extraction.
- Defining a custom schema for use with utility bills using `pydantic` or `JSON`.
- Viewing structured field extractions and metadata.
- Not covered:
    - Connecting to upstream document sources.
    - Inserting parse() and extract() results into structured tables.
    - Optimizing pipeline throughput.



In [None]:
# ---
# Title: Fast, Accurate Parsing of Utility Bills with LandingAI
# Author: Andrea Kropp
# Description: How to apply a custom extraction schema to pull fields out of photos and PDFs of utility bills.
# Target Audience: Developers, Product Managers
# Content Type: How-To
# Change Log:
#   - v1.0: 2025-09-22 Initial version
#   - v2.0: Updated 2026-01-16 Switch to using landingai-ade library
# ---

### Thumbnails for the Electric Bills in the Demo

Notice that there are 3 photos and 6 PDFs. The extraction process shown here works for both without any modifications.

<img src="images/electric-bills-to-parse-PDF-and-image.png" width="80%" alt="Electric bill image preview">

## üì¶ Setup & Imports

Import necessary packages and utility functions. Ensure you have installed the required dependencies:

```bash
pip install landingai-ade python-dotenv pandas
```

Obtain your API Key from the Visual Playground at https://va.landing.ai/settings/api-key

Read about options for setting your API at https://docs.landing.ai/ade/ade-python

This notebook uses a `.env` file in the same directory to store the API key.

In [1]:
# Standard libraries
import os
import json
from datetime import date
from pathlib import Path
from dotenv import load_dotenv

# Agentic Document Extraction from LandingAI
from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema

# Print library version
import landingai_ade
print(f"üì¶ landingai-ade version: {landingai_ade.__version__}")

üì¶ landingai-ade version: 1.4.0


In [2]:
# Initialize the ADE client (uses VISION_AGENT_API_KEY environment variable)
# See options at https://docs.landing.ai/ade/agentic-api-key

# Load environment variables from .env file
load_dotenv()

# Initialize the client (it will automatically use VISION_AGENT_API_KEY from environment)
client = LandingAIADE()
print("‚úÖ Authenticated client initialized")

‚úÖ Authenticated client initialized


## üìÅ Define Input and Output Directories

Specify where your documents are located and where results will be saved.


In [3]:
# Define input and output directory paths
base_dir = Path(os.getcwd())
input_folder = base_dir / "input_folder"
results_folder = base_dir / "results_folder"

# Create output folders if they don't exist
input_folder.mkdir(parents=True, exist_ok=True)
results_folder.mkdir(parents=True, exist_ok=True)

In [4]:
# Collect all files to be processed
# Check official documentation for all supported filetypes https://docs.landing.ai/ade/ade-file-types

file_paths = [
    p for p in input_folder.iterdir()
    if p.suffix.lower() in [".pdf", ".png", ".jpg", ".jpeg", ".doc", ".docx", ".odt", ".ppt", ".pptx", ".odp"]
]

print(f"üìÑ Found {len(file_paths)} documents to process")
for i, path in enumerate(file_paths[:10], 1):
    print(f"  {i}. {path.name}")

üìÑ Found 9 documents to process
  1. electric_C.jpg
  2. electric_B.jpg
  3. electric_A.jpg
  4. electric2.pdf
  5. electric3.pdf
  6. electric1.pdf
  7. electric4.pdf
  8. electric5.pdf
  9. electric6.pdf


## üìë Define Custom Schema for Field Extraction

The schema to extract specific fields from utility bills is defined in JSON and saved in a separate file named utility_bill.json

See https://docs.landing.ai/ade/ade-python#extraction-with-json-schema-file for more details.

In [5]:
# Load schema from JSON file and print it to view the contents
with open("utility_bill.json", "r") as f:
    schema_utility = f.read()

print(schema_utility)    

{
  "type": "object",
  "title": "Utility Bill Extraction Schema",
  "description": "Schema for extracting key fields from diverse utility bills.",
  "required": [
    "provider_info",
    "account_info",
    "billing_summary",
    "gas_charges",
    "electric_charges"
  ],
  "properties": {
    "provider_info": {
      "type": "object",
      "title": "Provider Information",
      "required": [
        "provider",
        "phone_number",
        "website",
        "usage_bar_chart"
      ],
      "properties": {
        "provider": {
          "type": "string",
          "title": "Utility Name",
          "description": "The name of the utility providing the service and issuing the bill."
        },
        "phone_number": {
          "type": "string",
          "title": "Customer Service Phone Number",
          "description": "The customer service phone number for the utility formatted XXX-XXX-XXXX."
        },
        "website": {
          "type": "string",
          "title": "Web

## üìÑ Single Document Example

Let's start with a single document to understand the workflow.

### Two-Step Process: Parse ‚Üí Extract

**Step 1: Parse**
The `parse()` method converts the document into structured markdown and chunks with grounding information.

**Step 2: Extract**
The `extract()` method applies your custom schema to pull specific fields from the markdown.

### Step 1: Parse a Single Document

In [6]:
from landingai_ade.types import ParseResponse, ExtractResponse

if len(file_paths) > 0:
    # Parse the first document
    single_doc = file_paths[0]
    print(f"üîç Parsing: {single_doc.name}")

    single_parse_result: ParseResponse = client.parse(
        document=single_doc,
        model="dpt-2-latest"
    )

    # Explore the parse result
    print(f"‚úÖ Parse complete!")

    print(f"Markdown length: {len(single_parse_result.markdown)} characters")
    print(f"Chunks: {len(single_parse_result.chunks)}")
    
    print(f"Parsing metadata: {single_parse_result.metadata}")
    print(f"Grounding details: {single_parse_result.grounding}")

    print(f"\nüìù Markdown preview (first 200 chars):")
    print(single_parse_result.markdown[:200] + "...")

üîç Parsing: electric_C.jpg
‚úÖ Parse complete!
Markdown length: 4392 characters
Chunks: 25
Parsing metadata: ParseMetadata(credit_usage=3.0, duration_ms=6108, filename='electric_C.jpg', job_id='910f07899e2540f5805fe11f3a5a88d2', org_id='u3z0u1hn4acl', page_count=1, version='dpt-2-20251103', failed_pages=[])
Grounding details: {'d19cd262-30cf-4190-9eff-36550c6a5c3d': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.039192840456962585, left=0.6765761971473694, right=0.9236930012702942, top=0.014950261451303959), page=0, type='chunkMarginalia'), 'e2140f7a-8aca-4160-b2dc-4906a9a4bd07': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.12159416079521179, left=0.08390103280544281, right=0.3306434750556946, top=0.06359539926052094), page=0, type='chunkLogo'), '8fd68cba-9804-432a-8a56-bfce7906c3e5': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.1371183842420578, left=0.47135013341903687, right=0.9220030903816223, top=0.0429876446723938), page=0, ty

### Step 2: Extract Structured Fields

In [7]:
if len(file_paths) > 0:

     # Extract structured data using the schema
    single_extraction_result: ExtractResponse = client.extract(
        markdown=single_parse_result.markdown,  # send the markdown from the parsing step
        schema=schema_utility
    )

    # View the extracted CME data

    print(f"‚úÖ Extraction complete!")
    
    print(f"\nüì¶ Extracted fields:")
    print(single_extraction_result.extraction)

    print(f"\nüì¶ Extracted field metadata:")
    print(single_extraction_result.extraction_metadata)

    print(f"\nüì¶ Extraction process details:")
    print(single_extraction_result.metadata)

‚úÖ Extraction complete!

üì¶ Extracted fields:
{'provider_info': {'provider': 'PSEG', 'phone_number': '1-800-436-7734', 'website': 'pseg.com/myaccount', 'usage_bar_chart': True}, 'account_info': {'account_holder': 'ARISLEIDY BAEZ NUNEZ', 'account_number': '7491381707', 'service_address': '1146 N 31ST ST CAMDEN CITY NJ 08105-4118', 'service_address_primary': '1146 N 31ST ST', 'service_address_city': 'CAMDEN CITY', 'service_address_state': 'NJ', 'service_address_zip': '08105'}, 'billing_summary': {'due_date': '2025-06-26', 'bill_date': '2025-06-11', 'service_start_date': '05-09-2025', 'service_end_date': '06-09-2025', 'total_amount_due': '$1,467.77'}, 'electric_charges': {'meter_number': None, 'usage_kwh': None, 'total_electric_charges': None}, 'gas_charges': {'meter_number': None, 'usage_therms': None, 'total_gas_charges': None}}

üì¶ Extracted field metadata:
{'provider_info': {'provider': {'value': 'PSEG', 'references': ['e2140f7a-8aca-4160-b2dc-4906a9a4bd07']}, 'phone_number': {'v

## üöÄ Run ADE Parse + Extract for All Input Files

Parse all documents in the input folder and save outputs:
- **Parse JSON** (`{filename}_parse.json`): Full parse response with markdown, chunks, grounding, and metadata
- **Markdown** (`{filename}.md`): Just the extracted text content
- **Extract JSON** (`{filename}_extract.json`): Structured extraction results with field metadata

Each output file is named after the input file for easy reference.

In [8]:

# Optional dictionary to store document types and parse results
results = {}

# Process each document in the folder
for input_file in input_folder.glob("*"):
    if input_file.suffix.lower() not in [".pdf", ".png", ".jpg", ".jpeg"]:
        continue
        
    doc_name = input_file.stem
    print(f"Processing document: {input_file.name}")
    
    # Step 1: Parse the document to extract layout and content
    parse_result: ParseResponse = client.parse(
        document=input_file,
        model="dpt-2-latest"
    )
    print("  ‚úÖ Parsing completed.")
    
    # Save parse results
    parse_json_path = results_folder / f"{doc_name}_parse.json"
    markdown_path = results_folder / f"{doc_name}.md"
    
    with open(parse_json_path, 'w', encoding='utf-8') as f:
        json.dump(parse_result.model_dump(), f, indent=2, ensure_ascii=False, default=str)
    
    with open(markdown_path, 'w', encoding='utf-8') as f:
        f.write(parse_result.markdown)
    
    print(f"  üíæ Saved parse JSON and markdown")
   
    # Step 2: Extract document type using the previously loaded schema
    print("  üéØ Running extraction...")
    extraction_result: ExtractResponse = client.extract(
        schema=schema_utility,
        markdown=parse_result.markdown
    )
    print("  ‚úÖ Extraction completed.")
    
    # Save extraction results
    extract_json_path = results_folder / f"{doc_name}_extract.json"
    with open(extract_json_path, 'w', encoding='utf-8') as f:
        json.dump(extraction_result.model_dump(), f, indent=2, ensure_ascii=False, default=str)
    
    print(f"  üíæ Saved extraction JSON\n")

    # Store in results dictionary. This will be used later to create a summary dataframe
    results[doc_name] = {
        "parse_result": parse_result,
        "extraction_result": extraction_result
    }

print(f"‚úÖ Processed {len(results)} documents")

Processing document: electric_C.jpg
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: electric_B.jpg
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: electric_A.jpg
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: electric2.pdf
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: electric3.pdf
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: electric1.pdf
  ‚úÖ Parsing completed.
  üíæ Saved pars

In [9]:
results

{'electric_C': {'parse_result': ParseResponse(chunks=[Chunk(id='94d75a84-7b0b-4996-ba56-4661863a2fe9', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.039192840456962585, left=0.6765761971473694, right=0.9236930012702942, top=0.014950261451303959), page=0), markdown="<a id='94d75a84-7b0b-4996-ba56-4661863a2fe9'></a>\n\nia.pxpsg.j10b0s01.ipsgbill 202506 102152.csv-272055-000009640", type='marginalia'), Chunk(id='d81276b6-2ab2-4870-aefa-0e0d1e92e2a6', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.12159416079521179, left=0.08390103280544281, right=0.3306434750556946, top=0.06359539926052094), page=0), markdown="<a id='d81276b6-2ab2-4870-aefa-0e0d1e92e2a6'></a>\n\n<::logo: PSEG\nPSEG\nAn orange sun-like symbol with rays emanating from the center, next to the company name in dark gray text.::>", type='logo'), Chunk(id='f7a7b5b1-744f-4a08-8daf-15b04ba077b0', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.1371183842420578, left=0.47135013341903687, right=0.922003090

## üìä Define Helper Functions

Helper functions to flatten nested dictionaries and create a summary DataFrame from extraction results.

In [15]:
# Define helper functions that flattens arbitrarily nested dicts and lists into flat, DataFrame-friendly key/value pairs.

import pandas as pd
from typing import Any, Dict, List, Tuple

def flatten_dict(
    data: Dict[str, Any],
    parent_key: str = "",
    sep: str = "_"
) -> Dict[str, Any]:
    items = {}
    for k, v in data.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k

        if isinstance(v, dict):
            items.update(flatten_dict(v, new_key, sep))
        elif isinstance(v, list):
            items[new_key] = str(v)  # lists ‚Üí string for DataFrame safety
        else:
            items[new_key] = v

    return items


def flatten_metadata(
    metadata: Dict[str, Any],
    parent_key: str = "",
    sep: str = "_"
) -> Dict[str, Any]:
    """Flatten nested metadata and extract chunk references."""
    items = {}
    
    for k, v in metadata.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        
        if isinstance(v, dict):
            # Check if this is a leaf node with 'value' and 'references'
            if 'references' in v:
                items[f"{new_key}_chunks"] = str(v['references'])
            else:
                # Recurse into nested metadata
                items.update(flatten_metadata(v, new_key, sep))
        elif isinstance(v, list):
            items[f"{new_key}_chunks"] = str(v)
        else:
            items[f"{new_key}_chunks"] = str(v)
    
    return items


def create_summary_dataframe(
    extraction_results: List[Tuple[Any, Any, str]]
) -> pd.DataFrame:
    records = []

    for _, extract_result, doc_name in extraction_results:
        extraction = extract_result.extraction or {}
        metadata = extract_result.extraction_metadata or {}

        # Flatten extraction fields
        flat_extraction = flatten_dict(extraction)
        
        # Flatten metadata to get chunk references
        flat_metadata = flatten_metadata(metadata)

        record = {
            "document_name": doc_name,
            **flat_extraction,
            **flat_metadata
        }

        records.append(record)

    return pd.DataFrame(records)

## üíæ Convert to Table and Save

Convert the field extractions to a pandas dataframe. Save it to the results folder created earlier.

In [16]:
print("\nüìä Creating summary DataFrame...")

# Convert results dictionary to the format expected by create_summary_dataframe
extraction_results = [
    (result["parse_result"], result["extraction_result"], doc_name)
    for doc_name, result in results.items()
]

df = create_summary_dataframe(extraction_results)

df


üìä Creating summary DataFrame...


Unnamed: 0,document_name,provider_info_provider,provider_info_phone_number,provider_info_website,provider_info_usage_bar_chart,account_info_account_holder,account_info_account_number,account_info_service_address,account_info_service_address_primary,account_info_service_address_city,...,billing_summary_bill_date_chunks,billing_summary_service_start_date_chunks,billing_summary_service_end_date_chunks,billing_summary_total_amount_due_chunks,electric_charges_meter_number_chunks,electric_charges_usage_kwh_chunks,electric_charges_total_electric_charges_chunks,gas_charges_meter_number_chunks,gas_charges_usage_therms_chunks,gas_charges_total_gas_charges_chunks
0,electric_C,PSEG,1-800-436-7734,pseg.com/myaccount,True,ARISLEIDY BAEZ NUNEZ,7491381707,1146 N 31ST ST CAMDEN CITY NJ 08105-4118,1146 N 31ST ST,CAMDEN CITY,...,['38c2c2cb-c2ea-4ac2-a12a-cfb2fe663142'],['38c2c2cb-c2ea-4ac2-a12a-cfb2fe663142'],['38c2c2cb-c2ea-4ac2-a12a-cfb2fe663142'],['32b5da15-95d2-4467-8970-428c62930bd3'],[],[],[],[],[],[]
1,electric_B,Alabama Power,800-245-2244,AlabamaPower.com,True,ERIKA J ZAPATA,96762-33381,703 RALEIGH CT APT A BIRMINGHAM AL 35209,703 RALEIGH CT APT A,BIRMINGHAM,...,[],['fe143fe6-815a-467e-9d1f-3798bffe4e0a'],['fe143fe6-815a-467e-9d1f-3798bffe4e0a'],"['0-5', '4d52e5e5-9fe8-4331-ad3a-0f205ff4d208'...",[],"['f676505e-6d00-41cd-a117-d64123e5039e', '0-D'...",['0ecaac0c-e0be-4afa-8785-5d89d1de14ee'],[],[],[]
2,electric_A,Mid-Carolina Electric Cooperative,803-749-6400,www.mcecoop.com,,CARL P TERRY,7700000024,"134 LAND OF LAKES CIR LEXINGTON, SC 29073-7702",134 LAND OF LAKES CIR,LEXINGTON,...,"['0-s', '9c7053ae-b333-4d76-8190-da65a5454e28']","['0-n', '0-s']","['0-n', '0-s']","['0-6', '0-O']",['0-k'],"['0-o', '0-x']","['0-b', '0-K']",[],[],[]
3,electric2,PSEG,800-436-7734,pseg.com,True,EDITH AVELLA,7002365118,214 KIPP AVE APT C HASBROUCK HEIGHTS NJ 07604-...,214 KIPP AVE APT C,HASBROUCK HEIGHTS,...,['1c277f6e-83a4-4172-ab04-d1dacc1a972f'],['1c277f6e-83a4-4172-ab04-d1dacc1a972f'],['1c277f6e-83a4-4172-ab04-d1dacc1a972f'],['3408b619-83e7-4332-b86e-894c3b174810'],['c659be1f-9fe0-4a53-8927-5a22df303932'],['c659be1f-9fe0-4a53-8927-5a22df303932'],"['ed886210-6bdf-441c-9591-5f3d2f98addd', '3-r']","['238bbaac-96ec-4f42-8f3d-5cb7402eae48', '2-2']","['238bbaac-96ec-4f42-8f3d-5cb7402eae48', '2-g']","['3933b2e9-aaf3-4be4-a340-fac91c62221a', '2-H']"
4,electric3,conEdison,800-752-6633,conEd.com/MyAccount,True,MITCHELL JOHNSON,44-6011-0985-0021-7,435 W 57 STRE 2H,435 W 57 STRE 2H,NEW YORK,...,['0-6'],"['0-i', '1-5']","['0-i', '1-5']","['9ed1ee03-796c-4116-bba4-381da6a256c2', '0-p'...",['1-g'],"['1-o', '88c7401c-4a93-42f3-af42-ca7f24b22b7a']",['1-H'],[],[],[]
5,electric1,"MOUNTAIN VIEW ELECTRIC ASSOCIATION, INC.",1-800-388-9881,www.mvea.coop,True,RON A BAUMERT,61358805,935 FLAMING TREE WAY MONUMENT CO 80132-9306,935 FLAMING TREE WAY,MONUMENT,...,['7ab49125-ed87-4cd7-963d-f4d919b034d1'],['e3293829-c88d-4a35-88cb-38430b472a15'],['e3293829-c88d-4a35-88cb-38430b472a15'],['5b3e2318-ff07-486c-8077-45308b6165eb'],['7ab49125-ed87-4cd7-963d-f4d919b034d1'],['e6c716ea-8f2b-4b69-af45-518ffd649e65'],"['1-d', '1-f']",[],[],[]
6,electric4,Duke Energy,800-700-8744,duke-energy.com,False,KAREN G PEREZ,9100 7883 2561,13619 TORTONA LN APT 3121 WINDERMERE FL 34786,13619 TORTONA LN APT 3121,WINDERMERE,...,['2eb96aa1-51cc-44b6-878a-b8379f534a89'],['2eb96aa1-51cc-44b6-878a-b8379f534a89'],['2eb96aa1-51cc-44b6-878a-b8379f534a89'],['0-b'],['2-2'],['2-8'],['0-7'],[],[],[]
7,electric5,SDG&E,1-877-646-5525,sdge.com,True,DAINETTE R. WOODS,7397 873 592 9,568 S 36TH ST SAN DIEGO CA 92113,568 S 36TH ST,SAN DIEGO,...,['74e2d321-fe73-4fb2-a016-8fea219b37d0'],"['0-q', '0-u']","['0-q', '0-u']","['0-j', '85cbb661-6aae-4359-9b9f-f97d92a10bf9'...",[],['0-v'],['0-w'],[],['0-r'],['0-s']
8,electric6,Mississippi Power,1-800-532-1502,mississippipower.com,True,WILLIAM A VALENCIA,09931-83323,2501 W 7TH ST APT 224 HATTIESBURG MS 39401,2501 W 7TH ST APT 224,HATTIESBURG,...,[],['c3c2b5c5-b595-41dd-b001-d077fe9741f6'],['c3c2b5c5-b595-41dd-b001-d077fe9741f6'],['0-5'],['1-j'],['1-p'],['55fb8094-af35-4a03-8805-6e73ff5d8916'],[],[],[]


In [14]:
# Save the DataFrame to a CSV file inside the results_folder
csv_path = results_folder / "utility_output.csv"
df.to_csv(csv_path, index=False)

## ‚úÖ Wrap-Up

You‚Äôve now used LandingAI‚Äôs ADE to:
- Define custom extrcation fields using a JSON schema
- Parse and extract data from images and PDFs
- Export structured results to a table

To learn more, visit the [LandingAI Documentation](https://docs.landing.ai/ade/ade-overview).