#  ADE Lambda S3 - Serverless Document Processing

In [1]:
# ---
# LandingAI Applied AI Content Notebook Template
# ---
# Title:  ADE Lambda S3 - Serverless Document Processing
# Author: Ava Xia
# Description: Streamlined notebook for testing and using the deployed Lambda function
# Target Audience: [Developers, Partners, Customers]
# Content Type: [Tutorial, How-To]
# Publish Date: 2025-09-23
# ADE Version: v0.1.5
# Change Log:
#    - v1.0: Initial draft
#    - v1.1: Modularized with utility functions
#    - v1.2: Modularized with consolidated utility functions
# ---

## Files Structure:
- **`.env`** - Environment variables (API keys, AWS settings)
- **`config.py`** - Pydantic schemas and configuration management  
- **`utils.py`** - All utility functions consolidated

## 1️⃣ Environment Setup & Configuration

### 🔑 Prerequisites

Before running this notebook:
1. **Copy `.env.example` to `.env`** and fill in your values:
   ```bash
   cp .env.example .env
   # Edit .env with your API keys and AWS settings
   ```

2. **Login to AWS SSO** (if using SSO):
   ```bash
   aws configure sso
   aws sso login --profile your-profile-name
   ```

In [2]:
# Import consolidated utilities and configuration
import os
import json
from datetime import datetime
from pathlib import Path
import pandas as pd
from IPython.display import display, JSON

# Import consolidated modules
from config import get_settings, InvoiceExtractionSchema
from utils import (
    setup_aws_environment,
    list_s3_files,
    check_lambda_environment,
    get_lambda_metrics,
    process_single_file,
    process_batch_extraction,
    display_parsing_result,
    display_extraction_result,
    display_batch_dataframe
)

# Initialize environment using config.py and .env
print("="*60)
print("🔧 Initializing AWS environment...")
print("="*60)

# Load configuration (automatically reads from .env)
config, clients, AWS_ACCOUNT_ID, aws_session = setup_aws_environment()

# Check if credentials are valid
if AWS_ACCOUNT_ID in ['EXPIRED', 'ERROR']:
    print("\n⚠️  Please refresh your AWS credentials using the cell above")
else:
    # Extract configuration values
    BUCKET_NAME = config['bucket_name']
    FUNCTION_NAME = config['function_name']
    ECR_REPO = config['ecr_repo']
    AWS_REGION = config['aws_region']
    
    print("\n✅ Environment ready!")
    print(f"   Lambda: {FUNCTION_NAME}")
    print(f"   Bucket: {BUCKET_NAME}")
    print(f"   Region: {AWS_REGION}")
    print("="*60)

🔧 Initializing AWS environment...
✅ AWS Environment configured
   Profile: workload-dev-2
   Region: us-east-2
   Account: 9700XXXX1993

✅ Environment ready!
   Lambda: ade-lambda-s3
   Bucket: cf-mle-testing
   Region: us-east-2


## 2️⃣ Docker Build & ECR Deployment

### Understanding the Architecture
Refer to Readme.md for a step-by-step guide on these steps.

```
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Docker    │ ---> │     ECR     │ ---> │   Lambda    │
│   Image     │      │ Repository  │      │  Function   │
└─────────────┘      └─────────────┘      └─────────────┘
     Build              Push                  Deploy
```



## 3️⃣ Verify Lambda deployment status

In [3]:

env_status = check_lambda_environment(clients['lambda'], FUNCTION_NAME)

if env_status.get('configured'):
    print("\n✅ Lambda is deployed and configured!")
    
    # Get Lambda metrics
    metrics = get_lambda_metrics(clients['lambda'], FUNCTION_NAME)
    print(f"\n📊 Function Details:")
    print(f"   State: {metrics.get('State', 'Unknown')}")
    print(f"   Memory: {metrics.get('Memory', 'Unknown')}")
    print(f"   Timeout: {metrics.get('Timeout', 'Unknown')}")
    print(f"   Architecture: {metrics.get('Architecture', 'Unknown')}")
else:
    print("\n⚠️  Lambda not fully configured.")
    print("   Please run ./build.sh and ./deploy.sh first.")

🔐 Lambda Environment Configuration
Environment Variables:
   ✅ 🔑 LandingAI API Key: bDk2****TlFD
   ✅ 🪣 S3 Bucket: cf-mle-testing
   ℹ️  🌍 AWS Region: Using default

✅ Lambda is deployed and configured!
📊 Lambda Function Metrics
   Function: ade-lambda-s3
   State: Active
   Memory: 1024 MB
   Timeout: 300 seconds
   Architecture: arm64
   Package Type: Image
   Last Modified: 2025-09-25T20:44:05.000+0000

📊 Function Details:
   State: Active
   Memory: Unknown
   Timeout: 300
   Architecture: Unknown


## 4️⃣ Browse Available S3 Files

View documents available for processing:

In [4]:
# List files in S3 bucket
files = list_s3_files(clients['s3'], BUCKET_NAME, "invoices/", max_files=20)

if files:
    # Create DataFrame for better visualization
    df_files = pd.DataFrame(files)
    print("\n📋 Available Files:")
    display(df_files.head(10))
    if len(files) > 10:
        print(f"\n... and {len(files)-10} more files")
else:
    print("📂 No files found in invoices/ folder")
    print("   Upload some PDF files to process")

📂 Files in s3://cf-mle-testing/invoices/
Found 20 files

📋 Available Files:


Unnamed: 0,File,Size,Modified
0,invoices/invoice_1.pdf,381.5 KB,2025-09-23 08:03
1,invoices/invoice_10.pdf,122.6 KB,2025-09-25 22:39
2,invoices/invoice_11.pdf,116.5 KB,2025-09-25 22:39
3,invoices/invoice_12.pdf,57.6 KB,2025-09-25 22:39
4,invoices/invoice_13.pdf,205.2 KB,2025-09-25 22:39
5,invoices/invoice_14.pdf,221.3 KB,2025-09-25 22:39
6,invoices/invoice_15.pdf,210.8 KB,2025-09-25 22:39
7,invoices/invoice_16.pdf,149.9 KB,2025-09-25 22:39
8,invoices/invoice_17.pdf,151.1 KB,2025-09-25 22:39
9,invoices/invoice_18.pdf,43.7 KB,2025-09-25 22:39



... and 10 more files


## 5️⃣ Single Document Processing

Process a single document in two modes:
- **Parsing Mode**: Parse the entire document
- **Extraction Mode**: Extract structured data based on schema
### Option 1: Parsing Mode 

In [5]:
# Process a single file with parsing mode
test_file = "invoices/invoice_4.pdf"

print(f'Parsing {test_file}')

# Process with parsing mode
result = process_single_file(
    clients['lambda'], 
    FUNCTION_NAME, 
    BUCKET_NAME, 
    test_file,
    extraction=False  # Parsing mode
)

# Display results using utility function with S3 client
display_parsing_result(result, test_file, s3_client=clients['s3'])

Parsing invoices/invoice_4.pdf
📄 Parsing document: invoices/invoice_4.pdf
Mode: Parsing (document structure)
Returns: List of chunks (text, table, figure types)

✅ Parsing successful!
   Results saved to: s3://cf-mle-testing/ade-results/invoice_4_parsed_20250925_224010.json

📄 Raw Parsed Output:


<IPython.core.display.JSON object>

### Option 2: Extraction Mode (Structured Data)

In [6]:
# Process a single file with extraction mode
test_file = "invoices/invoice_3.pdf"

print(f'Extracting {test_file}')

# Process with extraction mode
result = process_single_file(
    clients['lambda'], 
    FUNCTION_NAME, 
    BUCKET_NAME, 
    test_file,
    document_type="invoice",  # Specify document type for extraction
    extraction=True  # Extraction mode
)

# Display results using utility function with S3 client
display_extraction_result(result, test_file, document_type="invoice", s3_client=clients['s3'])

Extracting invoices/invoice_3.pdf
📄 Extracting structured data from: invoices/invoice_3.pdf
Mode: Extraction (structured data)
Schema: InvoiceExtractionSchema

✅ Extraction successful!
   Results saved to: s3://cf-mle-testing/ade-results/invoice_3_extracted_20250925_224050.json

📊 Extracted Data (JSON format):
------------------------------------------------------------


<IPython.core.display.JSON object>

## 6️⃣ Batch Processing

In [7]:
# Process batch and get DataFrame
df_invoices = process_batch_extraction(
  clients['lambda'],
  clients['s3'],
  FUNCTION_NAME,
  BUCKET_NAME,
  "invoices/",
  document_type="invoice",
  extraction=True,
  session=aws_session 
)

# Display the DataFrame using utility function
csv_file = display_batch_dataframe(df_invoices, export_csv=True)

📋 Batch Invoice Extraction Test
   Found 26 PDF files to process
   ⏱️  Estimated time: 4-5 minutes

🚀 Invoking Lambda for batch processing...
⏱️  Lambda returned after 203.5 seconds                                         

✅ Batch processing successful!
   Documents processed: 26
   Average time per document: 7.8s
   Results location: s3://cf-mle-testing/ade-results/batch_extracted_20250925_224415.json

📥 Downloading results from S3...
   Processing 26 documents...

📊 Extracted Data as DataFrame:
------------------------------------------------------------


Unnamed: 0,File,Invoice #,Date,Customer,Supplier,Subtotal,Tax,Total,Currency,Line Items,Status
0,invoice_1.pdf,INV33543191,2020-07-29,Abaxys Tech LLC,Zoom Video Communications Inc.,$149.90,-,-,USD,1,PAID
1,invoice_10.pdf,1000110140,2025-05-15,ANDREA KROPP,Sheraton Tucson Hotel & Suites,-,-,-,USD,10,
2,invoice_11.pdf,2071221,2021-08-30,Souhail Martesse,DollarFulfillment,-,-,"$1,800.87",USD,1,
3,invoice_12.pdf,11828454,,ANDREA KROPP,Condor Flugdienst GmbH,-,-,"$2,579.96",USD,7,
4,invoice_13.pdf,812,2021-12-02,SAGAR ASIA PRIVATE LIMITED,KANDHAN METAL COMPANY,"$5,102,920.00","$918,525.60","$6,021,446.00",INR,1,
5,invoice_14.pdf,40458946,2019-02-23,"Gnr-Grupo Novo Rock, Lda",Thomann GmbH,-,-,$77.24,EUR,5,
6,invoice_15.pdf,0000329003,2019-04-04,Nazish,Jade E-Services Pakistan Private Limited,$147.00,-,$147.00,PKR,1,
7,invoice_16.pdf,1,2023-03-20,Mansoer Walizada,Walmart,"$1,529.94",$110.92,"$1,640.86",USD,1,PAID
8,invoice_17.pdf,2014/00355,2014-12-10,Sandip Patil,Variant Technologies,"$2,800.00",-,"$2,800.00",,1,
9,invoice_18.pdf,00000116271,2020-02-10,"Meridian Venture Services, LLC","Howard Custom Transfers, Inc.",$270.00,-,-,USD,2,PAID



📈 Summary Statistics:
   Total records: 26
   Total value: $6,095,824.76
   Unique customers: 20
   Unique suppliers: 26

💾 Results exported to: /Users/avaxia/landingAI/ade-lambda-s3/Workflows/ADE_Lambda_S3/output_folder/extraction_results_20250925_154416.csv
