# Lab 4: Document Understanding with Agentic Document Extraction

In this lab, you will use LandingAI's Agentic Document Extraction (ADE) framework to parse documents and extract key-value pairs using a single API. Note that Lab 4 has two parts. Here in the first part, we cover Exercise 1: Extract Key-Value Pairs from a Utility Bill and Exercise 2: ADE on Difficult Documents. In the second part, we cover Exercise 3: Automated Pipeline for Loan Applications.   

**Learning Objectives:**
- Use the Parse API to convert documents into structured markdown with visual grounding
- Define JSON schemas to extract specific fields from documents
- Use the Extract API to pull key-value pairs with source location references

## Background

ADE is built on three approaches:
- Vision-First: Documents are visual objects where meaning is encoded in layout, structure, and spatial relationships
- Data-Centric: Models are trained on large, diverse, and curated datasets
- Agentic: The system plans, decides, acts, and verifies until responses meet quality thresholds

The foundation is the **Document Pre-trained Transformer (DPT)** family of models that combine text parsing, layout detection, reading order, and multimodal reasoning capabilities.

## Outline

- [1. Setup and Authentication](#1)
- [2. Helper Functions](#2)
- [3. Exercise 1: Extract Key-Value Pairs from a Utility Bill](#3)
  - [3.1 Preview the Document](#3-1)
  - [3.2 Parse the Document](#3-2)
  - [3.3 Explore the Output from Parse](#3-3)
  - [3.4 Extract Key-Value Pairs](#3-4)
  - [3.5 Explore the Output from Extract](#3-5)
- [4. Exercise 2: ADE on Difficult Documents](#4)
  - [4.1 Charts and Flowcharts](#4-1)
  - [4.2 Tables with Missing Gridlines](#4-2)
  - [4.3 Handwritten Forms](#4-3)
  - [4.4 Handwritten Calculus](#4-4)
  - [4.5 Illustrations and Infographics](#4-5)
  - [4.6 Stamps and Signatures](#4-6)
- [5. Summary](#5)

<a id="1"></a>

## 1. Setup and Authentication

Import the required libraries. Key imports from LandingAI:
- `LandingAIADE`: Client for making API calls
- `ParseResponse`: Type for document parsing results
- `ExtractResponse`: Type for field extraction results

In [None]:
# General imports

import os
import json
import pymupdf
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from IPython.display import display, IFrame, Markdown, HTML
from IPython.display import Image as DisplayImage
from PIL import Image as PILImage, ImageDraw

In [None]:
# Imports specific to Agentic Document Extraction

from landingai_ade import LandingAIADE
from landingai_ade.types import ParseResponse, ExtractResponse

In [None]:
# Load environment variables from .env
_ = load_dotenv(override=True)

Initialize the ADE client. The API key is loaded automatically from the environment variable `VISION_AGENT_API_KEY`.

To use ADE outside this course, you can generate a free API key at [va.landing.ai](https://va.landing.ai).

In [None]:
# Initialize the client
client = LandingAIADE()
print("Authenticated client initialized")

<a id="2"></a>

## 2. Helper Functions

Import visualization functions from `helper.py` to display documents and draw bounding boxes around detected chunks.

In [None]:
from helper import print_document, draw_bounding_boxes, draw_bounding_boxes_2
from helper import create_cropped_chunk_images

<a id="3"></a>

## 3. Exercise 1: Extract Key-Value Pairs from a Utility Bill

Parse a utility bill and extract specific fields like current charges, gas usage, and electric usage. The workflow:
1. Preview the document
2. Parse with DPT-2 to get structured markdown and chunks
3. Extract key-value pairs using a JSON schema

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> ðŸš¨
&nbsp; <b>Different Run Results:</b> LandingAI continues to innovate with DPT-2. Considering updates to the model, your results might differ slightly from those shown in the video.</p>

<a id="3-1"></a>

### 3.1 Preview the Document

A combined gas and electric bill from a San Diego utility with monthly billing period, separate charges, and usage history charts.

In [None]:
print_document("utility_example/utility_bill.pdf")

<a id="3-2"></a>

### 3.2 Parse the Document

The Parse API converts the document into structured markdown with:
- Chunks: Semantic regions (text, tables, figures, logos, marginalia)
- Bounding boxes: Coordinates for each chunk
- Markdown: Text representation with embedded chunk IDs

Using `dpt-2-latest` provides the most current version of the DPT-2 model.

In [None]:
# Specify the file path to the document
document_path = Path("utility_example/utility_bill.pdf")

print("âš¡ Calling API to parse document...")

# Parse the document using the Parse() API
parse_result: ParseResponse = client.parse(
    document=document_path,
    model="dpt-2-latest"
)

print(f"Parsing completed.")
print(f"job_id: {parse_result.metadata.job_id}")
print(f"Filename: {parse_result.metadata.filename}")
print(f"Total time (ms): {parse_result.metadata.duration_ms}")
print(f"Total pages: {len(parse_result.splits)}")
print(f"Total markdown characters: {len(parse_result.markdown)}")
print(f"Total chunks: {len(parse_result.chunks)}")

<a id="3-3"></a>

### 3.3 Explore the Output from Parse

The parse result contains the document structure. Visualize the detected chunks with bounding boxes.

In [None]:
# Create and view an annotated version
draw_bounding_boxes(parse_result, document_path)

Each chunk has a unique ID, type, page number, and bounding box coordinates. Chunk types include: `logo`, `text`, `table`, `figure`, and `marginalia`.

In [None]:
# Inspect the first 5 chunks
parse_result.chunks[0:5]

In [None]:
print(f"The first chunk has an id: {parse_result.chunks[0].id}")
print(f"The first chunk is type: {parse_result.chunks[0].type}")
print(f"The first chunk is on page: {parse_result.chunks[0].grounding.page}")
print(f"The first chunk is at box coordinates: {parse_result.chunks[0].grounding.box}")

In [None]:
# How many chunks of each type?
counts = {}

for chunk in parse_result.model_dump()["chunks"]:
    t = chunk["type"]
    counts[t] = counts.get(t, 0) + 1

print(counts)

The top-level markdown contains the full document content with embedded chunk IDs. These IDs enable visual grounding namely tracing extracted values to their source location.

In [None]:
print("TOP-LEVEL MARKDOWN CONTENTS")
print(f"{parse_result.markdown[0:500]}")

Tables are rendered as HTML with unique IDs for each cell. This enables extraction to reference specific table cells.

In [None]:
# Chunk-level markdown rendered
display(HTML(parse_result.chunks[9].markdown))

In [None]:
print(" ")
print("CHUNK-LEVEL MARKDOWN CONTENTS")
print(f"{parse_result.chunks[9].markdown}")

<a id="3-4"></a>

### 3.4 Extract Key-Value Pairs from the Document

Define a JSON schema specifying the fields to extract. The schema supports:
- Nested objects: (e.g., `account_summary` with sub-fields)
- Multiple types: `number`, `string`, `boolean`
- Rich descriptions: Guide the extraction model

More descriptive field definitions improve extraction accuracy.

In [None]:
schema_dict = {
    "type": "object",
    "title": "Utility Bill Field Extraction Schema",
    "properties": {
    "account_summary": {
      "type": "object",
      "title": "Account Summary",
      "properties": {
        "current_charges": {
          "type": "number",
          "description": "The charges incurred during the current billing "
            "period."
        },
        "total_amount_due": {
          "type": "number",
          "description": "The total amount currently due."
        }
      }
    },
    "gas_summary": {
      "type": "object",
      "title": "Gas Usage Summary",
      "properties": {
        "total_therms_used": {
          "type": "number",
          "description": "Total therms of gas used in the billing period."
        },
        "gas_current_charges": {
          "type": "number",
          "description": "The gas charges incurred during the current "
            "billing period."
        },
        "gas_usage_chart": {
          "type": "boolean",
          "description": "Does the document contain a chart of historical "
            "gas usage?"
        },
        "gas_max_month": {
          "type": "string",
          "description": "Which month has the highest historical gas usage? "
            "Return month name only."
        }
      }
    },
    "electric_summary": {
      "type": "object",
      "title": "Electric Usage Summary",
      "properties": {
        "total_kwh_used": {
          "type": "number",
          "description": "Total kilowatt hours of electricity used in the "
            "billing period."
        },
        "electric_current_charges": {
          "type": "number",
          "description": "The gas charges incurred during the current "
            "billing period."
        },
        "electric_usage_chart": {
          "type": "boolean",
          "description": "Does the document contain a chart of historical "
            "electric usage?"
        },
        "electric_max_month": {
          "type": "string",
          "description": "Which month has the highest historical electric "
            "usage? Return month name only."
        }
      }
    }
  }
}

# Convert the dictionary into a JSON-formatted string
schema_json = json.dumps(schema_dict)

Call the Extract API with the schema and markdown from the parse step. The Extract API uses the structured markdown from the Parse API to find the requested fields in the text. 

In [None]:
print("âš¡ Calling API to extract from the document...")

# Using the Extract() API to extract structured data using the schema
extraction_result: ExtractResponse = client.extract(
            schema=schema_json,
            markdown=parse_result.markdown, # Notice that the input used is the top-level markdown from the parse step
            model="extract-latest"
)

print(f"Extraction completed.")

<a id="3-5"></a>

### 3.5 Explore the Output from Extract

The extraction result contains:
- extraction: The extracted values matching your schema
- extraction_metadata: References to source chunk/cell IDs for each value

In [None]:
# View all extracted values
extraction_result.extraction

The metadata provides visual grounding. Short IDs like `0-a` refer to table cells, while longer UUIDs refer to figure or text chunks. This enables verification UIs that highlight the exact source of each extracted value.

In [None]:
# View all metadata for extracted values
extraction_result.extraction_metadata

<a id="4"></a>

## 4. Exercise 2: ADE Performance on Difficult-to-Parse Documents

See how ADE handles document types that are challenging for traditional OCR and VLM approaches:
- Charts and flowcharts with complex spatial relationships
- Tables with missing gridlines and merged cells
- Handwritten forms with checkboxes and circles
- Illustrations without text
- Official stamps and signatures

The same API and DPT model handles all of these without additional configuration.

In [None]:
def parse_document(parse_filename: str, model = "dpt-2-latest", 
                   display_option = "HTML") -> ParseResponse:
    """
    Parse a document with ADE and display the result in the desired format.

    Args:
        parse_filename: Path to the document to parse.
        display_option: One of:
            - "Raw Markdown" : print the markdown as plain text
            - "HTML"         : render the markdown as HTML in the notebook

    Returns:
        ParseResponse: The full parse response object.
    """

    document_path = Path(parse_filename)
    
    print("âš¡ Calling API to parse document...")
    
    full_parse_result: ParseResponse = client.parse(  
        #send document to Parse API
        document=document_path, 
        model=model
    )

    _ = draw_bounding_boxes(full_parse_result, document_path=document_path)

    print(f"Parsing completed.")
    print(f"job_id: {full_parse_result.metadata.job_id}")
    print(f"Total pages: {len(full_parse_result.splits)}")
    print(f"Total time (ms): {full_parse_result.metadata.duration_ms}")
    print(f"Total markdown characters: {len(full_parse_result.markdown)}")
    print(f"Number of chunks: {len(full_parse_result.chunks)}")
    print(f" ")
    print("Complete Markdown:")

    if display_option == "Raw Markdown":
        print("Complete Markdown (raw):")
        print(full_parse_result.markdown)

    elif display_option == "HTML":
        print("Rendering markdown as HTML...")
        display(HTML(full_parse_result.markdown))

    else:
        print(
            f"[Unknown display_option '{display_option}'; "
            "valid options are 'Raw Markdown' or 'HTML'. "
            "Defaulting to HTML.]"
        )
        display(HTML(full_parse_result.markdown))

<a id="4-1"></a>

### 4.1 Charts and Flowcharts

Charts encode meaning through bars, lines, and spatial relationships. Flowcharts use arrows to show connections that don't follow standard reading order.

In [None]:
print_document("difficult_examples/Investor_Presentation_pg7.png")

In [None]:
parse_document("difficult_examples/Investor_Presentation_pg7.png", 
               display_option = "Raw Markdown")

This HR flowchart has arrows pointing in multiple directions. Traditional OCR would fail because "Select candidate" appears *above* "Good reference" on the page, but logically follows it in the workflow.

In [None]:
print_document("difficult_examples/hr_process_flowchart.png")

In [None]:
parse_document("difficult_examples/hr_process_flowchart.png", 
               display_option = "HTML")

<a id="4-2"></a>

### 4.2 Tables with Many Cells, Missing Gridlines, and Merged Cells

Real-world tables often lack clear gridlines, have merged header cells, or contain blank cells. ADE handles these by understanding the visual structure.

In [None]:
print_document("difficult_examples/virology_pg2.pdf")

In [None]:
parse_document("difficult_examples/virology_pg2.pdf", 
               display_option = "HTML")

This "mega table" has over 1,000 cells with merged rows and columns. LLMs typically hallucinate on large tables because they can't hold all the numbers in context. The agentic approach processes tables visually, avoiding this limitation.

In [None]:
print_document("difficult_examples/sales_volume.png")

In [None]:
parse_document("difficult_examples/sales_volume.png", 
               display_option = "HTML")

<a id="4-3"></a>

### 4.3 Handwritten Form with Checkboxes and Circles

This patient intake form combines handwriting, checkboxes, and circled selections. ADE detects these interaction patterns and converts them to structured markdown.

In [None]:
print_document("difficult_examples/patient_intake.pdf")

In [None]:
parse_document("difficult_examples/patient_intake.pdf", 
               display_option = "Raw Markdown")

<a id="4-4"></a>

### 4.4 Handwritten Calculus Answer Sheet

Mathematical notation like integrals, square roots, and fractions requires understanding visual symbols. ADE can parse handwritten math into structured representations.

In [None]:
print_document("difficult_examples/calculus_BC_answer_sheet.jpg")

In [None]:
parse_document("difficult_examples/calculus_BC_answer_sheet.jpg", display_option = "HTML")

<a id="4-5"></a>

### 4.5 Illustrations and Infographics

Some documents contain no text - only illustrations. For these, use `dpt-1-latest` which provides more detailed figure descriptions.

In [None]:
print_document("difficult_examples/ikea-assembly.pdf")

In [None]:
parse_document("difficult_examples/ikea-assembly.pdf", 
               model = "dpt-1-latest", 
               display_option = "Raw Markdown")

In [None]:
print_document("difficult_examples/ikea_infographic.jpg")

In [None]:
parse_document("difficult_examples/ikea_infographic.jpg", 
               display_option = "Raw Markdown")

<a id="4-6"></a>

### 4.6 Certificate of Origin with Stamps and Signatures

Official documents often contain stamps with curved text and handwritten signatures. ADE detects these as `attestation` chunk types and extracts their content.

In [None]:
print_document("difficult_examples/certificate_of_origin.pdf")

In [None]:
parse_document("difficult_examples/certificate_of_origin.pdf", 
               display_option = "HTML")

<a id="5"></a>

## 5. Summary

Here's what we learned about LandingAI's ADE framework:

| Concept | Description |
|---------|-------------|
| **Parse API** | Converts documents to structured markdown with chunks, bounding boxes, and unique IDs |
| **Extract API** | Pulls key-value pairs using JSON schemas with references for visual grounding |
| **DPT Models** | Document Pre-trained Transformers (DPT-2) that understand documents visually |
| **Chunk Types** | `logo`, `text`, `table`, `figure`, `marginalia`, `attestation` |
| **Visual Grounding** | Each extracted value references its source chunk |

In the next lab, you will build a complete document processing pipeline for loan automation using document categorization and extraction schemas.