# 🧠 Agentic Document Extraction with LandingAI

This notebook demonstrates how to use the `agentic_doc` Python package to extract structured information from documents using LandingAI's Agentic Document Extraction (ADE) service.

We'll walk through:
- Parsing documents with ADE
- Defining a custom schema using `pydantic`
- Viewing structured field extractions
- Saving results to CSV

> 📎 Supported formats: `.pdf`, `.png`, `.jpg`, `.jpeg`

## 📦 Setup & Imports

Import necessary packages and utility functions. Ensure you have installed `agentic_doc`, Pillow, and other dependencies:

```bash
pip install agentic-doc pillow
```

Obtain your API Key from the Visual Playground at https://va.landing.ai/settings/api-key

Read about options for setting your API at https://docs.landing.ai/ade/agentic-api-key

In [None]:
# Install required dependencies:

!pip install agentic-doc pillow



In [None]:
# Add your VISION AGENT API KEY
from getpass import getpass
import os

os.environ["VISION_AGENT_API_KEY"] = getpass("Enter your API key: ")

Enter your API key: ··········


In [None]:
# Standard libraries
import os
import json
from datetime import date
from pathlib import Path

# Agentic Document Extraction from LandingAI
from agentic_doc.parse import parse

## 📁 Define Input and Output Directories

Specify where your documents are located and where results will be saved.


In [None]:
# Define input and output directory paths
base_dir = Path(os.getcwd())
input_folder = base_dir / "input_folder"
results_folder = base_dir / "results_folder"
groundings_folder = base_dir / "groundings_folder"

# Create output folders if they don't exist
input_folder.mkdir(parents=True, exist_ok=True)
results_folder.mkdir(parents=True, exist_ok=True)
groundings_folder.mkdir(parents=True, exist_ok=True)

## 🗂️ Collect Document File Paths

This block filters input files for supported formats.


In [None]:
file_names = [
    p.name
    for p in input_folder.iterdir()
    if p.suffix.lower() in [".pdf", ".png", ".jpg", ".jpeg"]
]

file_names

['CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf']

In [None]:
# Collect all document file paths in input folder with supported extensions
# Convert each Path object to a string to ensure compatibility with parse()

file_paths = [
    str(p)
    for p in input_folder.iterdir()
    if p.suffix.lower() in [".pdf", ".png", ".jpg", ".jpeg"]
]

file_paths

['/content/input_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf']

## 🚀 Run Agentic Document Extraction

Call the `parse()` function from `agentic_doc` to extract structured data and save results to the output folders.

See https://docs.landing.ai/ade/ade-parse-docs for details

In [None]:
# Parse documents using LandingAI ADE

result = parse(
    documents=file_paths,
    result_save_dir=str(results_folder),
    grounding_save_dir=str(groundings_folder),
    include_marginalia=True,
    include_metadata_in_markdown=True,
    )

[2m2025-08-19 16:14:15[0m [info   [0m] [1mAPI key is valid.             [0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:42)
[2m2025-08-19 16:14:15[0m [info   [0m] [1mParsing 1 documents           [0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:348)
[2m2025-08-19 16:14:15[0m [info   [0m] [1mSplitting PDF: '/content/input_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf' into 0 parts under '/tmp/tmpzpfp8si8'[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:236)


Parsing documents:   0%|          | 0/1 [00:00<?, ?it/s]

[2m2025-08-19 16:14:15[0m [info   [0m] [1mCreated /tmp/tmpzpfp8si8/CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:252)
[2m2025-08-19 16:14:15[0m [info   [0m] [1mStart parsing document part: 'File name: CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf	Page: [0:0]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:670)



Parsing document parts from 'CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf':   0%|          | 0/1 [00:00<?, ?it/s][A

HTTP Request: POST https://api.va.landing.ai/v1/tools/agentic-document-analysis "HTTP/1.1 200 OK" (_client.py:1025)
[2m2025-08-19 16:14:35[0m [info   [0m] [1mTime taken to successfully parse a document chunk: 19.91 seconds[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:823)
[2m2025-08-19 16:14:35[0m [info   [0m] [1mSuccessfully parsed document part: 'File name: CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf	Page: [0:0]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:679)



Parsing document parts from 'CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf': 100%|██████████| 1/1 [00:19<00:00, 19.91s/it]

[2m2025-08-19 16:14:35[0m [info   [0m] [1mSaving 18 chunks as images to '/content/groundings_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report_20250819_161435'[0m [[0m[1m[34magentic_doc.utils[0m][0m [36mfile_path[0m=[35mPosixPath('/content/input_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf')[0m [36mfile_type[0m=[35mpdf[0m (utils.py:82)
[2m2025-08-19 16:14:35[0m [info   [0m] [1mSaved the parsed result to '/content/results_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report_20250819_161435.json'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:467)



Parsing documents: 100%|██████████| 1/1 [00:20<00:00, 20.02s/it]


## 📑 Define Custom Schema for Field Extraction

Using `pydantic`, we define a schema to extract specific fields (e.g., product name) from the document.

See https://docs.landing.ai/ade/ade-extract-library

In [None]:
# Import pydantic for schema definition
from pydantic import BaseModel, Field
from typing import Optional

# Define schema for structured extraction
class CBCLabReport(BaseModel):
    # Patient Information
    patient_name: str = Field(description="Full name of the patient")
    patient_age: str = Field(description="Age of the patient with units (e.g., '21 Years')")

    # Sample and Report Information
    referring_doctor: str = Field(description="Name of the referring doctor (Ref. By)")
    sample_type: str = Field(description="Primary sample type (e.g., Blood)")

    # Laboratory Information
    lab_name: str = Field(description="Name of the pathology laboratory")
    pathologist_name: str = Field(description="Name and qualification of the pathologist")

    # Hemoglobin
    hemoglobin_value: float = Field(description="Hemoglobin (Hb) value")
    hemoglobin_status: Optional[str] = Field(description="Status if abnormal (Low/High)")

    # RBC Count
    rbc_count_value: float = Field(description="Total RBC count value")
    rbc_count_unit: str = Field(description="Unit for RBC count")

## 🚀 Run Agentic Document Extraction with Schema

Call the `parse()` function from `agentic_doc` to extract structured data and save results to the output folders.

Pass the `extraction_model` as an input to `parse()`.

To learn more about parsing visit [https://docs.landing.ai/ade/ade-parse-docs](https://docs.landing.ai/ade/ade-parse-docs).

In [None]:
# Run ADE using the custom Product schema for structured field extraction
result_fe = parse(
    documents=file_paths,
    grounding_save_dir=str(groundings_folder),
    extraction_model=CBCLabReport  # This line is new
    )

[2m2025-08-19 16:24:48[0m [info   [0m] [1mAPI key is valid.             [0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:42)
[2m2025-08-19 16:24:48[0m [info   [0m] [1mParsing 1 documents           [0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:280)


Parsing documents:   0%|          | 0/1 [00:00<?, ?it/s]

[2m2025-08-19 16:24:48[0m [info   [0m] [1mSplitting PDF: '/content/input_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf' into 0 parts under '/tmp/tmpm_plwjl7'[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:236)
[2m2025-08-19 16:24:48[0m [info   [0m] [1mCreated /tmp/tmpm_plwjl7/CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:252)
[2m2025-08-19 16:24:48[0m [info   [0m] [1mStart parsing document part: 'File name: CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf	Page: [0:0]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:670)



Parsing document parts from 'CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf':   0%|          | 0/1 [00:00<?, ?it/s][A

HTTP Request: POST https://api.va.landing.ai/v1/tools/agentic-document-analysis "HTTP/1.1 200 OK" (_client.py:1025)
[2m2025-08-19 16:25:13[0m [info   [0m] [1mTime taken to successfully parse a document chunk: 25.36 seconds[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:823)
[2m2025-08-19 16:25:13[0m [info   [0m] [1mSuccessfully parsed document part: 'File name: CBC-test-report-format-example-sample-template-Drlogy-lab-report_1.pdf	Page: [0:0]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:679)



Parsing document parts from 'CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf': 100%|██████████| 1/1 [00:25<00:00, 25.37s/it]

[2m2025-08-19 16:25:13[0m [info   [0m] [1mSaving 18 chunks as images to '/content/groundings_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report_20250819_162513'[0m [[0m[1m[34magentic_doc.utils[0m][0m [36mfile_path[0m=[35mPosixPath('/content/input_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report.pdf')[0m [36mfile_type[0m=[35mpdf[0m (utils.py:82)



Parsing documents: 100%|██████████| 1/1 [00:25<00:00, 25.46s/it]


In [None]:
# View results
result_fe

[ParsedDocument(markdown='Summary : This image is a logo representing a healthcare or medical professional, featuring a stylized person with a stethoscope inside a solid blue circle.\n\nlogo:  \nMain Elements :  \n  • Central figure is a simplified human icon (head and torso) in white.  \n  • A stethoscope is draped around the figure’s neck, with the chest piece visible on the right side.  \n  • The entire design is enclosed within a solid blue circular background.  \n\nDesign Details :  \n  • The logo uses only two colours: white for the figure and stethoscope, blue for the background.  \n  • The stethoscope is outlined in blue, matching the background.  \n  • No text, company name, or tagline is present.  \n  • The figure is centered within the circle, with symmetrical placement.  \n\nDimensions & Placement :  \n  • The circle is perfectly round, with the figure sized to fit comfortably inside, leaving a consistent margin.  \n  • The stethoscope’s chest piece is positioned to the rig

## 🔍 Explore Field Extraction Outputs

Dive into the result to understand the contents and structure.

In [None]:
# Access one document from the results
doc = result_fe[0] # Choose index based on available docs

In [None]:
# Extract various outputs
markdown_output = doc.markdown
chunk_output = doc.chunks
doc_type = doc.doc_type
result_path = str(doc.result_path)

# Print metadata
print("Document Type:", doc_type)
print("Result Path:", result_path)
print("Markdown Summary (first 100 chars):")
print(markdown_output[:100])

# Access and iterate through chunks
print(f"Total Chunks: {len(doc.chunks)}")

for i, chunk in enumerate(doc.chunks):
    print(f"\n--- Chunk {i+1} ---")
    print("Chunk ID:", chunk.chunk_id)
    print("Chunk Type:", chunk.chunk_type.value)  # e.g., 'text', 'figure', etc.
    print("Text (shortened):", chunk.text[:100].replace("\n", " "), "...")

    # Access grounding (box and image path)
    for grounding in chunk.grounding:
        box = grounding.box
        print("  Page:", grounding.page)
        print(f"  Box (l, t, r, b): ({box.l:.3f}, {box.t:.3f}, {box.r:.3f}, {box.b:.3f})")
        print("  Image Path:", str(grounding.image_path))


Document Type: pdf
Result Path: None
Markdown Summary (first 100 chars):
Summary : This image is a logo representing a healthcare or medical professional, featuring a styliz
Total Chunks: 18

--- Chunk 1 ---
Chunk ID: 193e57e9-49a1-443a-b565-891487d13f0b
Chunk Type: figure
Text (shortened): Summary : This image is a logo representing a healthcare or medical professional, featuring a styliz ...
  Page: 0
  Box (l, t, r, b): (0.030, 0.028, 0.111, 0.087)
  Image Path: /content/groundings_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report_20250819_162513/page_0/ChunkType.figure_193e57e9-49a1-443a-b565-891487d13f0b_0.png

--- Chunk 2 ---
Chunk ID: d2ec2d63-f990-4180-8d4f-10fe81f586c2
Chunk Type: text
Text (shortened): DRLOGY PATHOLOGY LAB ...
  Page: 0
  Box (l, t, r, b): (0.128, 0.031, 0.649, 0.057)
  Image Path: /content/groundings_folder/CBC-test-report-format-example-sample-template-Drlogy-lab-report_20250819_162513/page_0/ChunkType.text_d2ec2d63-f990-4180-8d4f-10fe

In [None]:
# print the field extractions
doc.extraction

CBCLabReport(patient_name='Yash M. Patel', patient_age='21 Years', referring_doctor='Dr. Hiren Shah', sample_type='Blood', lab_name='DRLOGY PATHOLOGY LAB', pathologist_name='Dr. Payal Shah (MD, Pathologist)', hemoglobin_value=12.5, hemoglobin_status='Low', rbc_count_value=5.2, rbc_count_unit='mill/cumm')

In [None]:
# print the metadata for the field extractions
doc.extraction_metadata

CBCLabReportMetadata(patient_name=MetadataType[str](value='Yash M. Patel', chunk_references=['127bd433-1105-475e-ae80-5f46df017845'], confidence=0.9999613491459469), patient_age=MetadataType[str](value='21 Years', chunk_references=['127bd433-1105-475e-ae80-5f46df017845'], confidence=1.0), referring_doctor=MetadataType[str](value='Dr. Hiren Shah', chunk_references=['9d190409-a355-4c55-9abd-c787f417594e'], confidence=0.9999896741288625), sample_type=MetadataType[str](value='Blood', chunk_references=['03982220-216c-4862-86df-4d7f6a78e4d9'], confidence=None), lab_name=MetadataType[str](value='DRLOGY PATHOLOGY LAB', chunk_references=['d2ec2d63-f990-4180-8d4f-10fe81f586c2'], confidence=0.9954670761376193), pathologist_name=MetadataType[str](value='Dr. Payal Shah (MD, Pathologist)', chunk_references=['a6ade182-7e9e-4ff2-8c30-26a3b1b2f0fe'], confidence=None), hemoglobin_value=MetadataType[float](value=12.5, chunk_references=['03982220-216c-4862-86df-4d7f6a78e4d9'], confidence=None), hemoglobin

In [None]:
# print the extracted patient name
doc.extraction.patient_name

'Yash M. Patel'

In [None]:
# print the chunk from which the patient name is extracted
# note that there can be more than one, so this is returned as a list
doc.extraction_metadata.patient_name.chunk_references

['127bd433-1105-475e-ae80-5f46df017845']

In [None]:
# print the page number and bounding box location for the chunk
target_id = '127bd433-1105-475e-ae80-5f46df017845'  #Update this value based on the response above

# Search through chunks to find the one with the matching ID
for chunk in doc.chunks:
    if chunk.chunk_id == target_id:
        print("Chunk type:", chunk.chunk_type)
        print("Chunk text:", chunk.text)
        for grounding in chunk.grounding:
            box = grounding.box
            print("Page:", grounding.page)
            print(f"Box Coordinates:")
            print(f"  Left (l):   {box.l}")
            print(f"  Top (t):    {box.t}")
            print(f"  Right (r):  {box.r}")
            print(f"  Bottom (b): {box.b}")
        break
else:
    print("Chunk ID not found.")

Chunk type: ChunkType.text
Chunk text: Yash M. Patel

Age : 21 Years
Sex : Male
PID : 555
Page: 0
Box Coordinates:
  Left (l):   0.031572625041007996
  Top (t):    0.14470963180065155
  Right (r):  0.20850571990013123
  Bottom (b): 0.22463364899158478


In [None]:
# print specific fields and the associated metadata for CBC lab report
print(f"The patient name is: {doc.extraction.patient_name}. This is extracted from chunk {doc.extraction_metadata.patient_name.chunk_references}")
print(f"The lab name is: {doc.extraction.lab_name}. This is extracted from chunk {doc.extraction_metadata.lab_name.chunk_references}")
print(f"The hemoglobin value is: {doc.extraction.hemoglobin_value}. This is extracted from chunk {doc.extraction_metadata.hemoglobin_value.chunk_references}")
print(f"The RBC count is: {doc.extraction.rbc_count_value} {doc.extraction.rbc_count_unit}. This is extracted from chunk {doc.extraction_metadata.rbc_count_value.chunk_references}")

The patient name is: Yash M. Patel. This is extracted from chunk ['127bd433-1105-475e-ae80-5f46df017845']
The lab name is: DRLOGY PATHOLOGY LAB. This is extracted from chunk ['d2ec2d63-f990-4180-8d4f-10fe81f586c2']
The hemoglobin value is: 12.5. This is extracted from chunk ['03982220-216c-4862-86df-4d7f6a78e4d9']
The RBC count is: 5.2 mill/cumm. This is extracted from chunk ['03982220-216c-4862-86df-4d7f6a78e4d9']


## 💾 Convert to Table and Save

Convert the field extractions to a pandas dataframe. Save it to the results folder created earlier.

In [None]:
import pandas as pd

# Assume result_fe is your list of ParsedDocument objects
# Example: result_fe = [ParsedDocument(...), ParsedDocument(...), ...]

# Note: This is a single CBC document, this will create a DataFrame with one row containing all extracted fields.
# If you process multiple CBC documents, each document will create a separate row in the DataFrame.

# Extract the CBC lab report data
records = []
for i in range(len(result_fe)):
    doc = result_fe[i]
    body = doc.extraction
    meta = doc.extraction_metadata
    cbc_dict = {
        "document_name": file_names[i],
        "patient_name": body.patient_name,
        "patient_age": body.patient_age,
        "referring_doctor": body.referring_doctor,
        "sample_type": body.sample_type,
        "lab_name": body.lab_name,
        "pathologist_name": body.pathologist_name,
        "hemoglobin_value": body.hemoglobin_value,
        "hemoglobin_status": body.hemoglobin_status,
        "rbc_count_value": body.rbc_count_value,
        "rbc_count_unit": body.rbc_count_unit,
        "patient_name_ref": meta.patient_name.chunk_references,
        "patient_age_ref": meta.patient_age.chunk_references,
        "referring_doctor_ref": meta.referring_doctor.chunk_references,
        "sample_type_ref": meta.sample_type.chunk_references,
        "lab_name_ref": meta.lab_name.chunk_references,
        "pathologist_name_ref": meta.pathologist_name.chunk_references,
        "hemoglobin_value_ref": meta.hemoglobin_value.chunk_references,
        "hemoglobin_status_ref": meta.hemoglobin_status.chunk_references,
        "rbc_count_value_ref": meta.rbc_count_value.chunk_references,
        "rbc_count_unit_ref": meta.rbc_count_unit.chunk_references
    }
    records.append(cbc_dict)

# Create DataFrame
df = pd.DataFrame(records)
df

Unnamed: 0,document_name,patient_name,patient_age,referring_doctor,sample_type,lab_name,pathologist_name,hemoglobin_value,hemoglobin_status,rbc_count_value,...,patient_name_ref,patient_age_ref,referring_doctor_ref,sample_type_ref,lab_name_ref,pathologist_name_ref,hemoglobin_value_ref,hemoglobin_status_ref,rbc_count_value_ref,rbc_count_unit_ref
0,CBC-test-report-format-example-sample-template...,Yash M. Patel,21 Years,Dr. Hiren Shah,Blood,DRLOGY PATHOLOGY LAB,"Dr. Payal Shah (MD, Pathologist)",12.5,Low,5.2,...,[127bd433-1105-475e-ae80-5f46df017845],[127bd433-1105-475e-ae80-5f46df017845],[9d190409-a355-4c55-9abd-c787f417594e],[03982220-216c-4862-86df-4d7f6a78e4d9],[d2ec2d63-f990-4180-8d4f-10fe81f586c2],[a6ade182-7e9e-4ff2-8c30-26a3b1b2f0fe],[03982220-216c-4862-86df-4d7f6a78e4d9],[03982220-216c-4862-86df-4d7f6a78e4d9],[03982220-216c-4862-86df-4d7f6a78e4d9],[03982220-216c-4862-86df-4d7f6a78e4d9]


In [None]:
# Save the DataFrame to a CSV file inside the results_folder
csv_path = results_folder / "cbc_output.csv"
df.to_csv(csv_path, index=False)

## ✅ Wrap-Up

You’ve now used LandingAI’s ADE to:
- Parse and extract data from images or PDFs
- Define custom fields using `pydantic`
- Export structured results to a table

To learn more, visit the [LandingAI Documentation](https://docs.landing.ai/ade/ade-overview).