# üß† Agentic Document Extraction with LandingAI

This notebook demonstrates how to use the `landingai-ade` Python package to extract structured information from documents using LandingAI's Agentic Document Extraction (ADE) service.

We'll walk through:
- Parsing documents with ADE
- Defining a custom schema using `pydantic`
- Viewing structured field extractions
- Saving results to CSV

## üì¶ Setup & Imports

Import necessary packages and utility functions. Ensure you have installed the required dependencies:

```bash
pip install landingai-ade python-dotenv pandas
```

Obtain your API Key from the Visual Playground at https://va.landing.ai/settings/api-key

Read about options for setting your API at https://docs.landing.ai/ade/ade-python

This notebook uses a `.env` file in the same directory to store the API key.

In [20]:
# Standard libraries
import os
import json
from datetime import date
from pathlib import Path
from dotenv import load_dotenv

# Agentic Document Extraction from LandingAI
from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema

# Print library version
import landingai_ade
print(f"üì¶ landingai-ade version: {landingai_ade.__version__}")

üì¶ landingai-ade version: 1.4.0


In [21]:
# Initialize the ADE client (uses VISION_AGENT_API_KEY environment variable)
# See options at https://docs.landing.ai/ade/agentic-api-key

# Load environment variables from .env file
load_dotenv()

# Initialize the client (it will automatically use VISION_AGENT_API_KEY from environment)
client = LandingAIADE()
print("‚úÖ Authenticated client initialized")

‚úÖ Authenticated client initialized


## üìÅ Define Input and Output Directories

Specify where your documents are located and where results will be saved.


In [22]:
# Define input and output directory paths
base_dir = Path(os.getcwd())
input_folder = base_dir / "input_folder"
results_folder = base_dir / "results_folder"

# Create output folders if they don't exist
input_folder.mkdir(parents=True, exist_ok=True)
results_folder.mkdir(parents=True, exist_ok=True)

In [23]:
# Collect all files to be processed
# Check official documentation for all supported filetypes https://docs.landing.ai/ade/ade-file-types

file_paths = [
    p for p in input_folder.iterdir()
    if p.suffix.lower() in [".pdf", ".png", ".jpg", ".jpeg", ".doc", ".docx", ".odt", ".ppt", ".pptx", ".odp"]
]

print(f"üìÑ Found {len(file_paths)} documents to process")
for i, path in enumerate(file_paths[:10], 1):
    print(f"  {i}. {path.name}")

üìÑ Found 5 documents to process
  1. CME_Mendez_ex5.png
  2. CME_Mendez_ex4.png
  3. CME_Mendez_ex1.png
  4. CME_Mendez_ex3.png
  5. CME_Mendez_ex2.png


## üìë Define Custom Schema for Field Extraction

Using `pydantic`, we define a schema to extract specific fields (e.g., recipient name, issuing organization, credits) from the CME certificates.

See https://docs.landing.ai/ade/ade-python#extract%3A-getting-started for more details.

In [28]:
# Import pydantic for schema definition
from pydantic import BaseModel, Field

# Define schema for structured extraction
class CME(BaseModel):
    recipient_name: str = Field(description="Full name of the individual who received the certificate. Only the name. Remove any prefixes such as Mr. Mrs. or Dr. Also remove any credentials that may appear after the name such as BS, MD, DDS, RN")
    issuing_org: str = Field(description="Full name of the organization issuing the certificate.")
    activity_title: str = Field(description="Title of the CME activity or material completed by the recipient.")
    date_awarded: date = Field(description="Date when the certificate or credit was awarded.")
    credit_awarded: str = Field(description="Amount and type of CME credit awarded to the recipient.")
    credit_numeric: float = Field(description="Amount of CME credit awarded.")
    ama_pra_cat1: bool = Field(description="True if the CME credits awarded qualify for AMA PRA Category 1.")
    ama_pra_cat2: bool = Field(description="True if the CME credits awarded qualify for AMA PRA Category 2.")

# Convert Pydantic model to JSON schema
cme_schema = pydantic_to_json_schema(CME)

## üìÑ Single Document Example

Let's start with a single document to understand the workflow.

### Two-Step Process: Parse ‚Üí Extract

**Step 1: Parse**
The `parse()` method converts the document into structured markdown and chunks with grounding information.

**Step 2: Extract**
The `extract()` method applies your custom schema to pull specific fields from the markdown.

### Step 1: Parse a Single Document

In [25]:
from landingai_ade.types import ParseResponse, ExtractResponse

if len(file_paths) > 0:
    # Parse the first document
    single_doc = file_paths[0]
    print(f"üîç Parsing: {single_doc.name}")

    single_parse_result: ParseResponse = client.parse(
        document=single_doc,
        model="dpt-2-latest"
    )

    # Explore the parse result
    print(f"‚úÖ Parse complete!")

    print(f"Markdown length: {len(single_parse_result.markdown)} characters")
    print(f"Chunks: {len(single_parse_result.chunks)}")
    
    print(f"Parsing metadata: {single_parse_result.metadata}")
    print(f"Grounding details: {single_parse_result.grounding}")

    print(f"\nüìù Markdown preview (first 200 chars):")
    print(single_parse_result.markdown[:200] + "...")

üîç Parsing: CME_Mendez_ex5.png
‚úÖ Parse complete!
Markdown length: 1090 characters
Chunks: 7
Parsing metadata: ParseMetadata(credit_usage=3.0, duration_ms=3151, filename='CME_Mendez_ex5.png', job_id='f169f3c03db344f7b5f7f8fdde90a836', org_id='u3z0u1hn4acl', page_count=1, version='dpt-2-20251103', failed_pages=[])
Grounding details: {'509bfc0a-6159-4fcb-8630-8101c9be7594': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.13180823624134064, left=0.18170011043548584, right=0.8152531385421753, top=0.0686657726764679), page=0, type='chunkText'), '1f47c7ca-2a8d-4ce5-8eac-a092bc6664da': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.2988176643848419, left=0.3912568986415863, right=0.5997633934020996, top=0.159275621175766), page=0, type='chunkText'), 'e2d834f1-fb00-44e5-b608-8582c4dd8006': GroundingParseResponseGrounding(box=ParseGroundingBox(bottom=0.4080730974674225, left=0.2980690598487854, right=0.6993106603622437, top=0.3120444715023041), page=0, type='ch

### Step 2: Extract Structured Fields

In [None]:
if len(file_paths) > 0:

     # Extract structured data using the schema
    single_extraction_result: ExtractResponse = client.extract(
        markdown=single_parse_result.markdown,  # send the markdown from the parsing step
        schema=cme_schema
    )

    # View the extracted CME data

    print(f"‚úÖ Extraction complete!")
    
    print(f"\nüì¶ Extracted fields:")
    print(single_extraction_result.extraction)

    print(f"\nüì¶ Extracted field metadata:")
    print(single_extraction_result.extraction_metadata)

    print(f"\nüì¶ Extraction process details:")
    print(single_extraction_result.metadata)

## üöÄ Run ADE Parse + Extract for All Input Files

Parse all documents in the input folder and save outputs:
- **Parse JSON** (`{filename}_parse.json`): Full parse response with markdown, chunks, grounding, and metadata
- **Markdown** (`{filename}.md`): Just the extracted text content
- **Extract JSON** (`{filename}_extract.json`): Structured extraction results with field metadata

Each output file is named after the input file for easy reference.

In [33]:

# Optional dictionary to store document types and parse results
results = {}

# Process each document in the folder
for input_file in input_folder.glob("*"):
    if input_file.suffix.lower() not in [".pdf", ".png", ".jpg", ".jpeg"]:
        continue
        
    doc_name = input_file.stem
    print(f"Processing document: {input_file.name}")
    
    # Step 1: Parse the document to extract layout and content
    parse_result: ParseResponse = client.parse(
        document=input_file,
        model="dpt-2-latest"
    )
    print("  ‚úÖ Parsing completed.")
    
    # Save parse results
    parse_json_path = results_folder / f"{doc_name}_parse.json"
    markdown_path = results_folder / f"{doc_name}.md"
    
    with open(parse_json_path, 'w', encoding='utf-8') as f:
        json.dump(parse_result.model_dump(), f, indent=2, ensure_ascii=False, default=str)
    
    with open(markdown_path, 'w', encoding='utf-8') as f:
        f.write(parse_result.markdown)
    
    print(f"  üíæ Saved parse JSON and markdown")
   
    # Step 2: Extract document type using the previously loaded schema
    print("  üéØ Running extraction...")
    extraction_result: ExtractResponse = client.extract(
        schema=cme_schema,
        markdown=parse_result.markdown
    )
    print("  ‚úÖ Extraction completed.")
    
    # Save extraction results
    extract_json_path = results_folder / f"{doc_name}_extract.json"
    with open(extract_json_path, 'w', encoding='utf-8') as f:
        json.dump(extraction_result.model_dump(), f, indent=2, ensure_ascii=False, default=str)
    
    print(f"  üíæ Saved extraction JSON\n")

    # Store in results dictionary. This will be used later to create a summary dataframe
    results[doc_name] = {
        "parse_result": parse_result,
        "extraction_result": extraction_result
    }

print(f"‚úÖ Processed {len(results)} documents")

Processing document: CME_Mendez_ex5.png
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: CME_Mendez_ex4.png
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: CME_Mendez_ex1.png
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: CME_Mendez_ex3.png
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

Processing document: CME_Mendez_ex2.png
  ‚úÖ Parsing completed.
  üíæ Saved parse JSON and markdown
  üéØ Running extraction...
  ‚úÖ Extraction completed.
  üíæ Saved extraction JSON

‚úÖ Processed 5 documents


In [34]:
results

{'CME_Mendez_ex5': {'parse_result': ParseResponse(chunks=[Chunk(id='23fd5471-0717-4ab5-b701-b7414298066a', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.13180823624134064, left=0.18170011043548584, right=0.8152531385421753, top=0.0686657726764679), page=0), markdown="<a id='23fd5471-0717-4ab5-b701-b7414298066a'></a>\n\n***CONTINUING MEDICAL EDUCATION CERTIFICATE***", type='text'), Chunk(id='332c09ab-40b2-4266-a4f4-2880d3c6c95a', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.2988176643848419, left=0.3912568986415863, right=0.5997633934020996, top=0.159275621175766), page=0), markdown="<a id='332c09ab-40b2-4266-a4f4-2880d3c6c95a'></a>\n\nMedscape\ncertifies that", type='text'), Chunk(id='891aab28-5f0c-4574-be29-34952a9e785d', grounding=ChunkGrounding(box=ParseGroundingBox(bottom=0.4080730974674225, left=0.2980690598487854, right=0.6993106603622437, top=0.3120444715023041), page=0), markdown="<a id='891aab28-5f0c-4574-be29-34952a9e785d'></a>\n\nManoel Cortes Mendez\nha

## üìä Define Helper Functions

Helper functions to flatten nested dictionaries and create a summary DataFrame from extraction results.

In [41]:
# Define helper functions that flattens arbitrarily nested dicts and lists into flat, DataFrame-friendly key/value pairs.

import pandas as pd
from typing import Any, Dict, List, Tuple

def flatten_dict(
    data: Dict[str, Any],
    parent_key: str = "",
    sep: str = "_"
) -> Dict[str, Any]:
    items = {}
    for k, v in data.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k

        if isinstance(v, dict):
            items.update(flatten_dict(v, new_key, sep))
        elif isinstance(v, list):
            items[new_key] = str(v)  # lists ‚Üí string for DataFrame safety
        else:
            items[new_key] = v

    return items


def create_summary_dataframe(
    extraction_results: List[Tuple[Any, Any, str]]
) -> pd.DataFrame:
    records = []

    for _, extract_result, doc_name in extraction_results:
        extraction = extract_result.extraction or {}
        metadata = extract_result.extraction_metadata or {}

        # Flatten extraction fields
        flat_extraction = flatten_dict(extraction)

        record = {
            "document_name": doc_name,
            **flat_extraction,
        }

        # Attach metadata references generically
        for field, meta in metadata.items():
            refs = meta.get("references") if isinstance(meta, dict) else None
            if refs is not None:
                record[f"{field}_chunks"] = str(refs)

        records.append(record)

    return pd.DataFrame(records)


## üíæ Convert to Table and Save

Convert the field extractions to a pandas dataframe. Save it to the results folder created earlier.

In [43]:
print("\nüìä Creating summary DataFrame...")

df = create_summary_dataframe(extraction_results)

df


üìä Creating summary DataFrame...


Unnamed: 0,document_name,recipient_name,issuing_org,activity_title,date_awarded,credit_awarded,credit_numeric,ama_pra_cat1,ama_pra_cat2,recipient_name_chunks,issuing_org_chunks,activity_title_chunks,date_awarded_chunks,credit_awarded_chunks,credit_numeric_chunks,ama_pra_cat1_chunks,ama_pra_cat2_chunks
0,CME_Mendez_ex5,Manoel Cortes Mendez,"Medscape, LLC",Patient Case: To Screen or Not to Screen for C...,2022-10-18,0.25 AMA PRA Category 1 Credit(s)‚Ñ¢,0.25,True,,['891aab28-5f0c-4574-be29-34952a9e785d'],"['332c09ab-40b2-4266-a4f4-2880d3c6c95a', '9d67...",['5d546a77-822b-4c1a-a327-5fcd5630ee70'],['6974c665-2168-489a-946d-075c6c2dccd8'],['6974c665-2168-489a-946d-075c6c2dccd8'],"['6974c665-2168-489a-946d-075c6c2dccd8', '9d67...","['6974c665-2168-489a-946d-075c6c2dccd8', '9d67...",[]
1,CME_Mendez_ex4,Manoel Cortes Mendez,The University of Texas MD Anderson Cancer Center,Cancer Survivorship Series: Module 1 - Overvie...,2022-10-18,0.75 AMA PRA Category 1 Credit(s)‚Ñ¢,0.75,True,,['0df7409d-2070-40d4-b216-cce320b3df17'],['853255b3-7ffe-44f9-8fe1-0a21b2df0fb3'],['6f049e31-2054-4816-8e8b-077ec3a46574'],['6f049e31-2054-4816-8e8b-077ec3a46574'],['5be2a9fc-2bf6-43a2-9c9e-bacf8cef989d'],['5be2a9fc-2bf6-43a2-9c9e-bacf8cef989d'],"['5be2a9fc-2bf6-43a2-9c9e-bacf8cef989d', '397e...",[]
2,CME_Mendez_ex1,Manoel Cortes Mendez,The Warren Alpert Medical School of Brown Univ...,"Fears, Bias and Discrimination - Substance Use...",2022-08-19,1.00 AMA PRA Category 1 Credits‚Ñ¢,1.0,True,,['768937c9-ed87-4c4c-9696-6086564077d4'],"['4d4a7cf6-e3bf-4ce6-8c14-bb63ead659b7', '553f...",['58f86398-f8cd-4001-834b-7c6110d233fa'],['f56ec3a7-513b-4b3c-afbe-5e54f59c5733'],['2b4e046e-7165-40c1-97f5-ffafd63627da'],['2b4e046e-7165-40c1-97f5-ffafd63627da'],['2b4e046e-7165-40c1-97f5-ffafd63627da'],[]
3,CME_Mendez_ex3,Manoel Cortes Mendez,Johns Hopkins University School of Medicine,Pain Medicine Management - Pain Management of ...,2022-08-18,1.00 AMA PRA Category 1 Credit(s)‚Ñ¢,1.0,True,,['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],['1d8245ab-9f31-406c-ac1b-2467584b38b8'],['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],['ee93e7ed-4eb5-40ea-89b6-98e8ef2cd09b'],[]
4,CME_Mendez_ex2,Manoel Cortes Mendez,Stanford University School of Medicine,Introduction to Food and Health,2022-06-27,2.50 AMA PRA Category 1 Credit(s),2.5,True,,['3ae64901-9273-431f-8ac2-7cd97251e3e9'],['3ae64901-9273-431f-8ac2-7cd97251e3e9'],['3e8a3adf-3437-40cb-8ee1-afbcd68fdffa'],['e7290719-0e1f-44b7-81b8-1a68a33b9deb'],['f0b5a2b3-4949-42e4-bb3c-d8ef3e466e5a'],['f0b5a2b3-4949-42e4-bb3c-d8ef3e466e5a'],['f0b5a2b3-4949-42e4-bb3c-d8ef3e466e5a'],[]


In [44]:
# Save the DataFrame to a CSV file inside the results_folder
csv_path = results_folder / "cme_output.csv"
df.to_csv(csv_path, index=False)

## ‚úÖ Wrap-Up

You‚Äôve now used LandingAI‚Äôs ADE to:
- Parse and extract data from images or PDFs
- Define custom fields using `pydantic`
- Export structured results to a table

To learn more, visit the [LandingAI Documentation](https://docs.landing.ai/ade/ade-overview).