# 💡 Fast, Accurate Parsing of Invoices with LandingAI

This notebook demonstrates how to use the `landing-ade` Python package to extract structured information from invoices using LandingAI's Agentic Document Extraction (ADE) service. 

We'll walk through:
- Parsing documents with ADE Parse API.
- Defining a custom schema for use with invoices using `pydantic` or `JSON`.
- Extracting the desired fields using ADE Extract API
- Viewing structured field extractions and metadata.
- Not covered:
    - Connecting to upstream document sources.
    - Inserting parse() and extract() results into structured tables.
    - Optimizing pipeline throughput.

> 📎 Supported formats: `.pdf`, `.png`, `.jpg`, `.jpeg`. (More coming soon)

In [1]:
# ---
# Title: Fast, Accurate Parsing of Invoices with LandingAI
# Author: Andrea Kropp
# Description: How to apply a custom extraction schema to pull fields out of photos and PDFs of invoices.
# Target Audience: Developers, Product Managers
# Content Type: How-To
# Publish Date: 2025-10-06
# ADE Version: landingai-ade-0.17.1
# Change Log:
#    - v1.0: Initial draft
# ---

### ✨ Install LandingAI's Agentic Document Extraction

```bash
!pip install landing-ade
```

### 🗝️ Obtain and Set an API Key

Obtain your API Key from the Visual Playground at https://va.landing.ai/settings/api-key

Read about options for setting your API at https://docs.landing.ai/ade/agentic-api-key


## 📦 Setup and Imports

In [2]:
# Standard libraries
import os
import json
from dotenv import load_dotenv
from datetime import date
from pathlib import Path

In [3]:
# LandingAI ADE library
from landingai_ade import LandingAIADE

In [4]:
# Helper functions for using ADE library
from ade_utilities import *

# Helper functions specific to this use case
from invoice_utilities import *

# Import your Pydantic schema for this use case
from invoice_schema import InvoiceExtractionSchema


In [5]:
# Load setting (including the VISION_AGENT_API_KEY) from the .env file
load_dotenv()

True

In [6]:
client = LandingAIADE(apikey=os.environ.get("VISION_AGENT_API_KEY"))
print("Authenticated client initialized")

Authenticated client initialized


In [7]:
import landingai_ade
print(landingai_ade.__version__)

0.17.1


## 📁 Define Input and Output Directories

Specify where your documents are located and where results will be saved.


In [None]:
# Define input and output directory paths
base_dir = Path(os.getcwd())
input_folder = base_dir / "input_folder"
results_folder = base_dir / "results_folder"

# Create output folders if they don't exist
results_folder.mkdir(parents=True, exist_ok=True)

## 🗂️ Collect Document File Paths

This block filters input files for supported formats.

In [9]:
# Collect all document file paths in input folder with supported extensions
# Convert each Path object to a string to ensure compatibility with parse()

file_paths = [
    str(p)
    for p in input_folder.iterdir()
    if p.suffix.lower() in [".pdf", ".png", ".jpg", ".jpeg"]
]
file_paths[0:5]

['/Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/input_folder/invoice_12.pdf',
 '/Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/input_folder/invoice_13.pdf',
 '/Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/input_folder/invoice_9.pdf',
 '/Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/input_folder/invoice_11.pdf',
 '/Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/input_folder/invoice_10.pdf']

## 📋 Understanding the Invoice Extraction Schema

Before we start parsing, let's understand what data we're extracting. Our `InvoiceExtractionSchema` defines a structured template for invoice data extraction. It's organized into 6 main categories:

### 1. **Invoice Information** (`invoice_info`)
- Invoice date (raw format and standardized YYYY-MM-DD)
- Invoice number
- Order date, PO number
- Payment status (PAID/UNPAID)

### 2. **Customer Information** (`customer_info`)
- Customer name (person or organization)
- Billing address
- Email address

### 3. **Supplier Information** (`company_info`)
- Supplier company name and address
- Contact details (email, phone)
- Sales representative
- Tax identifiers (GSTIN, PAN for India)

### 4. **Order Details** (`order_details`)
- Payment terms (e.g., Net 30)
- Shipping carrier and date
- Tracking number

### 5. **Financial Totals** (`totals_summary`)
- Currency code (ISO format)
- Total amount due (raw text and numeric)
- Subtotal, tax, shipping, handling fees

### 6. **Line Items** (`line_items`)
Each purchased item includes:
- Line number, SKU/part number
- Description
- Quantity, unit price, extended amount
- Item total

> 💡 **Note**: ADE supports **one level of nested schemas**. Our schema has a top-level `InvoiceExtractionSchema` with nested models like `DocumentInfo`, `CustomerInfo`, etc.

The complete schema is defined in `invoice_schema.py` using Pydantic models with field descriptions that guide the AI extraction engine.

## 📄 Single Invoice Parsing

Let's start with a single document to understand the ADE workflow.

### Two-Step Process: Parse → Extract

ADE uses a **two-step pipeline**:

#### Step 1: **Parse** 
The `parse()` method converts the document into structured markdown and chunks:
- **`markdown`**: Full document content in markdown format
- **`chunks`**: Individual text/image regions (paragraphs, tables, logos, etc.)
- **`grounding`**: Bounding box coordinates showing where each chunk appears in the original document
- **`metadata`**: Document info (filename, version, processing time, page count)
- **`splits`**: Page splits for multi-page documents

> 💡 **Parsing** is like OCR + layout understanding. It gives you the raw extracted content but not yet structured into your schema.

#### Step 2: **Extract**
The `extract()` method applies your custom schema to pull specific fields:
- Takes the markdown from Step 1
- Uses your Pydantic schema (e.g., `InvoiceExtractionSchema`)
- Returns structured data matching your schema fields
- Includes extraction metadata (confidence scores, source chunks)

> 💡 **Extraction** is like applying a template to the parsed content, pulling out only the fields you care about.

### Why Two Steps?

- **Flexibility**: Parse once, extract multiple times with different schemas
- **Debugging**: Inspect the raw parsed content to troubleshoot extraction issues
- **Efficiency**: Reuse parse results for multiple extraction needs

Let's see both steps in action!

In [None]:
# Example 1: Parse only (no saving or extraction yet)
# This converts the PDF/image into structured markdown and chunks

client = LandingAIADE()
single_result = client.parse(document=Path(file_paths[0]),)

# The parse result contains:
# - markdown: full document text
# - chunks: individual regions with bounding boxes
# - metadata: document version, processing info
print(f"Number of chunks: {len(single_result.chunks)}")
print("Global markdown:", single_result.markdown[:200] + "...")

Number of chunks: 46
Global markdown: <a id='810c399c-9902-4970-a2c1-8a43d71a853a'></a>

<::logo: Condor
condor
The logo features the word "condor" in bold, sans-serif, lowercase letters, followed by a circular emblem containing a stylize...


In [None]:
# Explore the parse result structure
# Uncomment any line below to inspect different parts:

# single_result.markdown           # Full document as markdown text
# single_result.chunks              # List of text/image regions with bounding boxes
# single_result.metadata            # Document info: filename, version, pages, processing time
# single_result.splits              # Page boundaries for multi-page documents
# single_result.grounding           # Bounding box coordinates for each chunk

In [11]:
# Example 2: Parse and save results to JSON
# This is the same as Example 1, but saves the output to a file
# Saved file: results_folder/parse_invoice_13.json

single_result_parse_save = parse_and_save(document_path=file_paths[1], client=client, output_dir=results_folder)
single_result_parse_save

Parse results saved to: /Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/results_folder/parse_invoice_13.json


ParseResponse(chunks=[Chunk(id='d38cc036-b449-457f-ba7a-ed5ec8db2df8', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.04086274281144142, left=0.3944183588027954, right=0.505725085735321, top=0.021210074424743652), page=0), markdown="<a id='d38cc036-b449-457f-ba7a-ed5ec8db2df8'></a>\n\nTax Invoice", type='marginalia'), Chunk(id='95b3f28f-7aa0-4ba9-adfd-92552a14e65f', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.03962094336748123, left=0.6299343705177307, right=0.8244345784187317, top=0.017798636108636856), page=0), markdown="<a id='95b3f28f-7aa0-4ba9-adfd-92552a14e65f'></a>\n\n(ORIGINAL FOR RECIPIENT)", type='text'), Chunk(id='634f7e5f-23be-428e-8015-4137f52b2c03', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.402715265750885, left=0.06802284717559814, right=0.49336951971054077, top=0.04924289882183075), page=0), markdown="<a id='634f7e5f-23be-428e-8015-4137f52b2c03'></a>\n\nKANDHAN METAL COMPANY\nOLD NO: 12, NEW NO :33,JANI BATCHA STREET.\nROYAPETTAH.CHENN

In [12]:
# Example 3: Complete pipeline - Parse AND Extract
# This does BOTH steps:
#   1. Parse the document (convert to markdown + chunks)
#   2. Extract structured data using InvoiceExtractionSchema
# Saves two files: 
#   - results_folder/parse_invoice_9.json (raw parse output)
#   - results_folder/extract_invoice_9.json (structured invoice fields)

single_result_full_pipe = parse_extract_save(file_paths[2], client, InvoiceExtractionSchema, output_dir= results_folder)
single_result_full_pipe

Parse results saved to: /Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/results_folder/parse_invoice_9.json
Extract results saved to: /Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/results_folder/extract_invoice_9.json


(ParseResponse(chunks=[Chunk(id='a4f89b4f-3569-49bc-ac94-6bf78ee607e6', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.096154123544693, left=0.08743181824684143, right=0.44194576144218445, top=0.033484891057014465), page=0), markdown="<a id='a4f89b4f-3569-49bc-ac94-6bf78ee607e6'></a>\n\n<::logo: Freshworks\nfreshworks\nA stylized leaf-like icon composed of multiple facets in shades of gray is positioned to the left of the text.::>", type='logo'), Chunk(id='f47a315b-a6f5-46b9-a458-5c1fd01e937f', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.21089597046375275, left=0.08651575446128845, right=0.4332708418369293, top=0.10671643912792206), page=0), markdown="<a id='f47a315b-a6f5-46b9-a458-5c1fd01e937f'></a>\n\nFreshworks Inc., (formerly known as Freshdesk Inc.)\n2950 S. Delaware St,\nSuite 201, San Mateo, CA 94403,\nU.S.A.\nPhone: +1 (866) 832 3090\nTax ID: 33-1218825\nTax Reg #: 33-1218825", type='text'), Chunk(id='41b10bea-d855-4c29-91d4-fd6bf2b21353', grounding=ChunkGr

## 🧩 Parallel ADE Parsing with Progress Tracking

This section performs **parallel document parsing** using the LandingAI Agentic Document Extraction (ADE) client.  
It scans the input directory for all `.pdf`, `.png`, `.jpg`, and `.jpeg` files, sends each file to the ADE API,  
and saves the extracted results to the specified output folder.

Key features:
- ⚡ **Parallel processing** with `ThreadPoolExecutor` to speed up large batches  
- 📊 **Real-time progress bar** using `tqdm` to visualize parsing progress  
- 💾 **Automatic result saving** via `save_parse_results()`  
- 🧱 **Robust handling** — skips over failed files gracefully  
- 🧠 **Results aggregation** — all successful `ParseResponse` objects are stored in `results_summary`

After execution, you'll see:
- A live progress bar showing parsing completion
- Status messages for each document
- A summary of how many documents were successfully parsed and saved

In [13]:
from pathlib import Path
from landingai_ade import LandingAIADE
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
from tqdm import tqdm
from ade_utilities import parse_extract_save
from invoice_schema import InvoiceExtractionSchema

# --- CONFIG ---
input_dir = Path("input_folder2")
output_dir = Path("results_folder")
output_dir.mkdir(parents=True, exist_ok=True)

max_workers = 10  # adjust for your system and ADE rate limits
pause_between_requests = 0.2  # small delay to avoid hitting rate limits

# --- CLIENT ---
client = LandingAIADE()

# --- FILE LIST ---
file_paths = [p for p in input_dir.glob("*.*") if p.suffix.lower() in (".pdf", ".png", ".jpg", ".jpeg")]
print(f"Found {len(file_paths)} documents to parse and extract.")

# --- WORKER FUNCTION ---
def process_file(path: Path):
    try:
        # Parse AND extract using the utility function
        parse_result, extract_result = parse_extract_save(
            path, 
            client, 
            InvoiceExtractionSchema, 
            output_dir=output_dir
        )
        time.sleep(pause_between_requests)
        return (parse_result, extract_result)  # 👈 return both results as tuple
    except Exception as e:
        print(f"❌ {path.name} failed: {e}")
        return None

# --- PARALLEL EXECUTION WITH PROGRESS BAR ---
results_summary = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(process_file, p) for p in file_paths]
    # tqdm progress bar updates as futures complete
    for future in tqdm(as_completed(futures), total=len(futures), desc="Processing documents"):
        result = future.result()
        if result is not None:
            results_summary.append(result)

# --- SUMMARY ---
success_count = len([r for r in results_summary if r is not None])
print(f"\n✅ Completed {success_count}/{len(file_paths)} documents successfully.")
print(f"📊 Each document has been parsed AND extracted with structured data.")

Found 3 documents to parse and extract.


Processing documents:   0%|          | 0/3 [00:00<?, ?it/s]

Parse results saved to: results_folder/parse_invoice_2.json
Parse results saved to: results_folder/parse_invoice_1.json
Parse results saved to: results_folder/parse_invoice_3.json
Extract results saved to: results_folder/extract_invoice_2.json


Processing documents:  33%|███▎      | 1/3 [00:14<00:29, 15.00s/it]

Extract results saved to: results_folder/extract_invoice_1.json


Processing documents:  67%|██████▋   | 2/3 [00:18<00:08,  8.25s/it]

Extract results saved to: results_folder/extract_invoice_3.json


Processing documents: 100%|██████████| 3/3 [00:20<00:00,  6.87s/it]


✅ Completed 3/3 documents successfully.
📊 Each document has been parsed AND extracted with structured data.





In [14]:
results_summary[0:5]

[(ParseResponse(chunks=[Chunk(id='5e627a04-6af2-4ca1-b7d7-24a9edea248d', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.2219770848751068, left=0.07200773060321808, right=0.4208279848098755, top=0.05207393318414688), page=0), markdown='<a id=\'5e627a04-6af2-4ca1-b7d7-24a9edea248d\'></a>\n\n<::logo: RR Roofing Renovations\nRR\nROOFING RENOVATIONS\nThe logo features a stylized house outline with large gray "RR" letters and smaller "ROOFING RENOVATIONS" text below.::>', type='logo'), Chunk(id='9e6635a3-de58-4fe2-956e-787713d5bca7', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.08431334048509598, left=0.4983729124069214, right=0.7101038694381714, top=0.04254310578107834), page=0), markdown="<a id='9e6635a3-de58-4fe2-956e-787713d5bca7'></a>\n\nRoofing Renovations INC\n336-862-5235", type='text'), Chunk(id='3d4edaa6-d809-4408-aa77-edc4baee0bc0', grounding=ChunkGrounding(box=ChunkGroundingBox(bottom=0.11778980493545532, left=0.7225770354270935, right=0.948281466960907, top=0

## 📊 Organizing Results into Summary Tables

After parsing and extracting all documents, we'll organize the data into **4 normalized DataFrames** that match a typical database schema:

### 1. **Markdown Table** - Full Document Text
- One row per invoice
- Contains the complete markdown output from parsing
- Useful for full-text search and debugging

### 2. **Chunks Table** - Document Regions with Coordinates
- One row per chunk (text region, table, logo, etc.)
- Includes bounding box coordinates (left, top, right, bottom)
- Shows page number where chunk appears
- Links back to invoice via `INVOICE_UUID`
- **Use case**: Visualizing where data was found, audit trails

### 3. **Invoice Main Table** - One Row Per Invoice
- Flattened structure with all top-level invoice fields
- Contains invoice info, customer, supplier, order details, totals
- Primary table for reporting and analytics
- Linked by `INVOICE_UUID` to other tables

### 4. **Line Items Table** - Multiple Rows Per Invoice
- One row for each product/service line item
- Contains SKU, description, quantity, price, amount
- Linked to parent invoice via `INVOICE_UUID`
- **Use case**: Product-level analytics, inventory tracking

All tables share common keys:
- `RUN_ID`: Identifies this batch processing run
- `INVOICE_UUID`: Unique identifier for each invoice
- `DOCUMENT_NAME`: Original filename
- `AGENTIC_DOC_VERSION`: ADE engine version used

This normalized structure makes it easy to insert into databases like Snowflake, PostgreSQL, or BigQuery.

In [15]:
# Convert the batch results into 4 normalized DataFrames
# Function located in invoice_utilities.py
# Takes the list of (parse_result, extract_result) tuples and creates structured tables

invoice_summaries = create_invoice_summary_tables(results_summary)

## 📝 Table 1: Markdown - Full Document Text per Invoice

In [16]:
# Display the markdown table
# One row per invoice with full document text
invoice_markdown = invoice_summaries[0]
invoice_markdown

Unnamed: 0,RUN_ID,INVOICE_UUID,DOCUMENT_NAME,AGENTIC_DOC_VERSION,MARKDOWN
0,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,dpt-2-20250919,<a id='5e627a04-6af2-4ca1-b7d7-24a9edea248d'><...
1,0499dbd8-4971-4903-ad2f-cb6a0af4285a,42c1cd0e-907e-4258-af34-a93519a33cdc,invoice_1.pdf,dpt-2-20250919,<a id='163da434-a2a8-4ac9-8cd0-559a3d6b3a78'><...
2,0499dbd8-4971-4903-ad2f-cb6a0af4285a,ddd6244c-9be3-411a-9f56-bf4a0325aaa7,invoice_3.pdf,dpt-2-20250919,<a id='00c1108c-2a02-4ebd-ae63-b366b6afc7cd'><...


## 🧩 Table 2: Chunks - Document Regions with Bounding Boxes

Each chunk represents a distinct region in the document (paragraph, table, logo, etc.) with its location coordinates. This is useful for:
- **Visual grounding**: See exactly where each piece of text was found
- **Audit trails**: Verify extraction accuracy by reviewing source chunks
- **Layout analysis**: Understand document structure

In [17]:
# Display the chunks table
# One row per chunk with bounding box coordinates (box_l, box_t, box_r, box_b)
# Coordinates are normalized 0-1 relative to page dimensions
invoice_chunks = invoice_summaries[1]
invoice_chunks

Unnamed: 0,RUN_ID,INVOICE_UUID,DOCUMENT_NAME,chunk_id,chunk_type,text,page,box_l,box_t,box_r,box_b
0,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,5e627a04-6af2-4ca1-b7d7-24a9edea248d,logo,<a id='5e627a04-6af2-4ca1-b7d7-24a9edea248d'><...,0,0.072008,0.052074,0.420828,0.221977
1,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,9e6635a3-de58-4fe2-956e-787713d5bca7,text,<a id='9e6635a3-de58-4fe2-956e-787713d5bca7'><...,0,0.498373,0.042543,0.710104,0.084313
2,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,3d4edaa6-d809-4408-aa77-edc4baee0bc0,text,<a id='3d4edaa6-d809-4408-aa77-edc4baee0bc0'><...,0,0.722577,0.044637,0.948281,0.11779
3,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,4218c1aa-560b-476b-8f2e-ec9508896069,text,<a id='4218c1aa-560b-476b-8f2e-ec9508896069'><...,0,0.053898,0.279113,0.276529,0.369774
4,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,65fbe8ff-f266-49cf-b951-4abb97fa3ac2,text,<a id='65fbe8ff-f266-49cf-b951-4abb97fa3ac2'><...,0,0.31915,0.279771,0.435022,0.374348
5,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,72cdd2d9-6f2a-4450-a82b-8ec4009a1eef,text,<a id='72cdd2d9-6f2a-4450-a82b-8ec4009a1eef'><...,0,0.498381,0.277921,0.66792,0.377318
6,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,2b891183-123d-4ee0-b61a-a6631ee7f205,text,<a id='2b891183-123d-4ee0-b61a-a6631ee7f205'><...,0,0.73391,0.278069,0.947393,0.338912
7,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,3bf54dcf-f6a8-478f-94b7-342f9689568a,table,<a id='3bf54dcf-f6a8-478f-94b7-342f9689568a'><...,0,0.053954,0.436618,0.948443,0.681845
8,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,2702091f-6a0d-43f3-ae65-eeaf5e5da24d,text,<a id='2702091f-6a0d-43f3-ae65-eeaf5e5da24d'><...,0,0.053102,0.720391,0.149619,0.766105
9,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,a331fa69-2369-42c9-b672-5b4d30e70c47,text,<a id='a331fa69-2369-42c9-b672-5b4d30e70c47'><...,0,0.05319,0.802936,0.55762,0.853752


## 📄 Table 3: Invoice Main - Flattened Invoice-Level Fields

This is the primary table for analytics and reporting. Each row contains all extracted invoice fields flattened into a single record.

In [18]:
# Display the main invoice table
# One row per invoice with all extracted fields (31 columns total)
# Includes: invoice info, customer, supplier, order details, financial totals
invoice_main = invoice_summaries[2]
invoice_main.head(10)

Unnamed: 0,RUN_ID,INVOICE_UUID,DOCUMENT_NAME,AGENTIC_DOC_VERSION,INVOICE_DATE_RAW,INVOICE_DATE,INVOICE_NUMBER,ORDER_DATE,PO_NUMBER,STATUS,...,SHIP_VIA,SHIP_DATE,TRACKING_NUMBER,CURRENCY,TOTAL_DUE_RAW,TOTAL_DUE,SUBTOTAL,TAX,SHIPPING,HANDLING_FEE
0,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,dpt-2-20250919,01/29/2020,2020-01-29,0001131,,,,...,,,,USD,"$2,650.00",2650.0,2650.0,0,,
1,0499dbd8-4971-4903-ad2f-cb6a0af4285a,42c1cd0e-907e-4258-af34-a93519a33cdc,invoice_1.pdf,dpt-2-20250919,07/29/2020,2020-07-29,INV33543191,,,,...,,,,USD,$149.90,149.9,149.9,0,,
2,0499dbd8-4971-4903-ad2f-cb6a0af4285a,ddd6244c-9be3-411a-9f56-bf4a0325aaa7,invoice_3.pdf,dpt-2-20250919,10/03/2022,2022-10-03,52255,,,,...,,,,,"$1,550.00",1550.0,1550.0,0,,


## 🛒 Table 4: Line Items - Product/Service Details

This table contains the individual line items from each invoice. Multiple rows per invoice for invoices with multiple products/services.

In [19]:
# Display the line items table
# Multiple rows per invoice, one for each purchased product/service
# Contains: line number, SKU, description, quantity, pricing details
invoice_items = invoice_summaries[3]
invoice_items.head(10)

Unnamed: 0,RUN_ID,INVOICE_UUID,DOCUMENT_NAME,AGENTIC_DOC_VERSION,LINE_INDEX,LINE_NUMBER,SKU,DESCRIPTION,QUANTITY,UNIT_PRICE,PRICE,AMOUNT,TOTAL
0,0499dbd8-4971-4903-ad2f-cb6a0af4285a,82d35867-787d-40cc-8d05-b6e86239ce06,invoice_2.pdf,dpt-2-20250919,0,,,Replace HVAC Unit.,1,2650.0,,2650.0,2650.0
1,0499dbd8-4971-4903-ad2f-cb6a0af4285a,42c1cd0e-907e-4258-af34-a93519a33cdc,invoice_1.pdf,dpt-2-20250919,0,,,Charge Name: Standard Pro Annual Quantity: 1 U...,1,149.9,,149.9,149.9
2,0499dbd8-4971-4903-ad2f-cb6a0af4285a,ddd6244c-9be3-411a-9f56-bf4a0325aaa7,invoice_3.pdf,dpt-2-20250919,0,,,Provide and install new 60 Amp 240Vac breaker ...,1,1400.0,,1400.0,
3,0499dbd8-4971-4903-ad2f-cb6a0af4285a,ddd6244c-9be3-411a-9f56-bf4a0325aaa7,invoice_3.pdf,dpt-2-20250919,1,,,Permit and inspection fees,1,150.0,,150.0,
4,0499dbd8-4971-4903-ad2f-cb6a0af4285a,ddd6244c-9be3-411a-9f56-bf4a0325aaa7,invoice_3.pdf,dpt-2-20250919,2,,,"Payment Terms: by Cash or Check, All sales fin...",1,0.0,,0.0,


## 💾 Save Structured Results

Save the four summary tables to CSV files. These could also be inserted into a database like Snowflake, PostgreSQL, or used for downstream analytics.

In [None]:
# Save all four DataFrames to CSV files in the results_folder
# Files created:
#   - invoice_markdown.csv
#   - invoice_chunks.csv
#   - invoice_main.csv
#   - invoice_items.csv

invoice_markdown.to_csv(results_folder / "invoice_markdown.csv", index=False)
invoice_chunks.to_csv(results_folder / "invoice_chunks.csv", index=False)
invoice_main.to_csv(results_folder / "invoice_main.csv", index=False)
invoice_items.to_csv(results_folder / "invoice_items.csv", index=False)

print(f"✅ Saved 4 CSV files to {results_folder}")

✅ Saved 4 CSV files to /Users/andreakropp/Documents/Github/andrea-kropp/ade_demos/Invoices/results_folder


## ✅ Wrap-Up

You’ve now used LandingAI’s ADE to:
- Parse and extract data from invoices, whether the originals are images or PDFs.
- Define custom fields using a `pydantic` schema.
- Run Agentic Document Extraction on a batch of documents and save the results.
- Save the extracted results and as structured data.

To learn more, visit the [LandingAI Documentation](https://docs.landing.ai/ade/ade-overview).