<div align="center">
    <img src="Logo.png" width="1000">
</div>

# Course 5: Agentic Document Extraction from LandingAI on AWS

---

## Part 0: Configuration Setup

- **Why:** Native dependencies must match your Lambda‚Äôs **Python version** and **CPU architecture** or imports may fail.
- **How to use:** Run locally *and* in Lambda (check CloudWatch). Ensure both **Major.Minor** and **Architecture** match your build environment.

In [1]:
import sys, platform

print("Full Python version:", sys.version)
print("Major.Minor:", f"{sys.version_info.major}.{sys.version_info.minor}")
print("Architecture:", platform.machine())

Full Python version: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]
Major.Minor: 3.10
Architecture: x86_64


In [2]:
# Install required packages
!pip install --quiet boto3 python-dotenv

### Part 0 - Step 1: Environment Setup

In [1]:
import boto3, os, json
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

True

### Part 0 - Step 2: Initialize the AWS Client & Import the Helper Functions

In [2]:
# Configure boto3 
if os.getenv("AWS_PROFILE"):
    del os.environ["AWS_PROFILE"]

session = boto3.Session(
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    region_name=os.getenv("AWS_REGION"),
)

# Instantiate the AWS Clients
s3_client = session.client("s3")
lambda_client = session.client("lambda")
iam = session.client("iam")  # Add IAM client for Lambda role management
logs = session.client("logs")  # CloudWatch Logs client for monitoring
bedrock_agent_runtime = session.client("bedrock-agent-runtime")
bedrock_runtime = session.client("bedrock-runtime")

# Test Connection
sts = session.client("sts")
print(json.dumps(sts.get_caller_identity(), indent=2))

{
  "UserId": "AIDA4YUVVZFMQPBTTFLFI",
  "Account": "877560973657",
  "Arn": "arn:aws:iam::877560973657:user/dlai_ade",
  "ResponseMetadata": {
    "RequestId": "3e855d96-4fc9-4d30-83d0-b9a98785db6a",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "3e855d96-4fc9-4d30-83d0-b9a98785db6a",
      "x-amz-sts-extended-request-id": "MTp1cy13ZXN0LTI6UzoxNzYzNzc0NzE4MjAzOlI6dFlWOGFESmM=",
      "content-type": "text/xml",
      "content-length": "405",
      "date": "Sat, 22 Nov 2025 01:25:18 GMT"
    },
    "RetryAttempts": 0
  }
}


In [20]:
# Import Helper Functions
import pandas as pd
from lambda_helpers import *

print("Helper functions loaded")

Helper functions loaded


---

## Part 2: Document Parsing and Medical Chatbot with Memory

### Part 2 - Step 1: Create a Deployment Package

In [6]:
source_files = ["ade_s3_handler.py"]
requirements = ["landingai-ade", "typing-extensions"]

zip_path = create_deployment_package(
    source_files=source_files,
    requirements=requirements,
    output_zip="ade_lambda.zip",
    package_dir="ade_package"
)

üì¶ Creating deployment package: ade_lambda.zip
   Installing dependencies: landingai-ade, typing-extensions
   Adding source: ade_s3_handler.py
   Creating zip archive...
‚úÖ Package created: ade_lambda.zip (4.4 MB)


### Part 2 - Step 2: Create or Get the IAM Role

In [7]:
role_arn = create_or_update_lambda_role(
  iam_client=iam,
  role_name="lambda-ade-exec-role",
  description="Execution role for LandingAI ADE Lambda"
)

‚ÑπÔ∏è Using existing role: lambda-ade-exec-role


### Part 2 - Step 3: Deploy the Lambda function with Skip-Existing Feature

In [8]:
env_vars = {
   "VISION_AGENT_API_KEY": os.getenv("VISION_AGENT_API_KEY"),
   "ADE_MODEL": "dpt-2-latest",
   "INPUT_FOLDER": "input/",
   "OUTPUT_FOLDER": "output/",
   "S3_BUCKET": os.getenv("S3_BUCKET"),
   "FORCE_REPROCESS": "false"  # Set to "true" to reprocess all files even if outputs exist
}

response = deploy_lambda_function(
    lambda_client=lambda_client,
    function_name="ade-s3-handler",
    zip_file="ade_lambda.zip",
    role_arn=role_arn,
    handler="ade_s3_handler.ade_handler",
    env_vars=env_vars,
    runtime="python3.10",
    timeout=900,
    memory_size=1024
)

üöÄ Deploying Lambda function: ade-s3-handler
‚ÑπÔ∏è Function exists, updating...
   Code updated, waiting for deployment...
‚úÖ Lambda function updated: ade-s3-handler


### Part 2 - Step 4: Set Up the S3 Trigger

In [9]:
# Trigger on all files in input/ folder
setup_s3_trigger(
  s3_client=s3_client,
  lambda_client=lambda_client,
  bucket=os.getenv("S3_BUCKET"),
  prefix="input/",
  function_name="ade-s3-handler",
  suffix=None  # Optional: set to ".pdf" to only trigger on PDF files
)

‚öôÔ∏è Setting up S3 trigger: s3://universal-docs-877560973657/input/ ‚Üí ade-s3-handler
   ‚ÑπÔ∏è Permission may already exist: An error occurred (ResourceConflictException) when calling the AddPermission operation: The statement id (s3invokepermission) provided already exists. Please provide a new statement id, or remove the existing statement.
‚úÖ S3 trigger set for s3://universal-docs-877560973657/input/ ‚Üí ade-s3-handler


### Part 2 - Step 5: Parse the Documents Using LandingAI's Agentic Document Extraction & Monitor Lambda

In [10]:
# Upload medical documents to S3 input folder
local_folder = "medical/" 

# Check if folder exists and upload
if os.path.exists(local_folder):
    count = upload_folder_to_s3(
        s3_client=s3_client,
        local_folder = local_folder,
        s3_prefix=f"input/{local_folder}",
        bucket=os.getenv("S3_BUCKET"),
        file_extensions=[".pdf", ".PDF"]
    )
    print(f"\n‚è≥ Waiting for automatic parsing to complete...")
    print("   (The existing Lambda will automatically convert PDFs to markdown)")
else:
    print(f"‚ö†Ô∏è Folder not found: {local_folder}")

üì§ Uploading medical/ ‚Üí s3://universal-docs-877560973657/input/medical/
   (Skipping files that already exist in S3)
   ‚¨ÜÔ∏è Uploading: Common_cold_clinincal_evidence.pdf
   ‚¨ÜÔ∏è Uploading: CT_Study_of_the_Common_Cold.pdf
   ‚¨ÜÔ∏è Uploading: Prevention_and_treatment_of_the_common_cold.pdf
   ‚¨ÜÔ∏è Uploading: Vitamin_C_for_Preventing_and_Treating_the_Common_Cold.pdf
   ‚¨ÜÔ∏è Uploading: Evaluation_of_echinacea_for_the_prevention_and_treatment_of_the_common_cold.pdf
   ‚¨ÜÔ∏è Uploading: Understanding_the_symptoms_of_the_common_cold_and_influenza.pdf
   ‚¨ÜÔ∏è Uploading: Viruses_and_Bacteria_in_the_Etiology_of_the_Common_Cold.pdf
   ‚¨ÜÔ∏è Uploading: The_common_cold_a_review_of_the_literature.pdf
‚úÖ Uploaded 8 files

‚è≥ Waiting for automatic parsing to complete...
   (The existing Lambda will automatically convert PDFs to markdown)


In [11]:
# Monitoring Lambda
stats = monitor_lambda_processing(
  logs_client=logs,
  s3_client=s3_client,
  bucket_name=os.getenv("S3_BUCKET")
)
# to stop monitoring, press esc followed by double clicking i

‚è≥ Monitoring Lambda processing...
   Press Ctrl+C to stop monitoring

‚úÖ Processed: Evaluation_of_echinacea_for_the_prevention_and_treatment_of_the_common_cold.pdf
‚úÖ Processed: Viruses_and_Bacteria_in_the_Etiology_of_the_Common_Cold.pdf
‚úÖ Processed: Prevention_and_treatment_of_the_common_cold.pdf
‚úÖ Processed: CT_Study_of_the_Common_Cold.pdf
‚úÖ Processed: The_common_cold_a_review_of_the_literature.pdf
‚úÖ Processed: Common_cold_clinincal_evidence.pdf
‚úÖ Processed: Understanding_the_symptoms_of_the_common_cold_and_influenza.pdf

‚õî Monitoring stopped by user

üìä Lambda Processing Summary:
   Processed: 7 files
   Skipped: 0 files
   Errors: 0 files

   Files processed in this session:
   - CT_Study_of_the_Common_Cold.pdf
   - Common_cold_clinincal_evidence.pdf
   - Evaluation_of_echinacea_for_the_prevention_and_treatment_of_the_common_cold.pdf
   - Prevention_and_treatment_of_the_common_cold.pdf
   - The_common_cold_a_review_of_the_literature.pdf
   - Understanding_the_symp


   Show all output files? (y/n):  y



   All output files:
   - output/Common_cold_clinincal_evidence.pdf.md
   - output/medical/CT_Study_of_the_Common_Cold.md
   - output/medical/Common_cold_clinincal_evidence.md
   - output/medical/Evaluation_of_echinacea_for_the_prevention_and_treatment_of_the_common_cold.md
   - output/medical/Prevention_and_treatment_of_the_common_cold.md
   - output/medical/The_common_cold_a_review_of_the_literature.md
   - output/medical/Understanding_the_symptoms_of_the_common_cold_and_influenza.md
   - output/medical/Viruses_and_Bacteria_in_the_Etiology_of_the_Common_Cold.md
   - output/medical/Vitamin_C_for_Preventing_and_Treating_the_Common_Cold.md


### Part 2 - Step 6: Set up the Knowledge Base on AWS Bedrock

In [12]:
# List all your knowledge bases
bedrock_agent = session.client("bedrock-agent")

print("üìã All Knowledge Bases in your account:")
kb_response = bedrock_agent.list_knowledge_bases()

for kb in kb_response.get("knowledgeBaseSummaries", []):
  print(f"\n‚ú® Knowledge Base: {kb['name']}")
  print(f"   ID: {kb['knowledgeBaseId']}")
  print(f"   Status: {kb['status']}")
  print(f"   Updated: {kb['updatedAt']}")

  # Get data sources for this knowledge base
  ds_response = bedrock_agent.list_data_sources(
      knowledgeBaseId=kb['knowledgeBaseId']
  )

  for ds in ds_response.get("dataSourceSummaries", []):
      print(f"   üìÅ Data Source: {ds['name']}")
      print(f"      ID: {ds['dataSourceId']}")
      print(f"      Status: {ds['status']}")

üìã All Knowledge Bases in your account:

‚ú® Knowledge Base: medical_chatbot
   ID: MY5NGNIDC9
   Status: ACTIVE
   Updated: 2025-11-21 23:29:36.970060+00:00
   üìÅ Data Source: medical
      ID: WCFICXNJXJ
      Status: AVAILABLE


### Part 2 - Step 7: Ingest the Parsed Outputs into the New Knowledege Base

In [13]:
BEDROCK_KB_ID = "MY5NGNIDC9"
DATA_SOURCE_ID = "WCFICXNJXJ"

In [14]:
response = bedrock_agent.start_ingestion_job(
    knowledgeBaseId=BEDROCK_KB_ID,
    dataSourceId=DATA_SOURCE_ID
)

job_id = response.get("ingestionJob", {}).get("ingestionJobId")
print("‚úÖ Knowledge base sync initiated.")
print(f"   - JobId:           {job_id}")

‚úÖ Knowledge base sync initiated.
   - JobId:           UEZTI1GHKW


### Part 2 - Step 8: Create Medical Document Agent with Memory

In [18]:
!pip install --quiet bedrock-agentcore strands-agents

In [4]:
from datetime import datetime
import strands
from bedrock_agentcore.memory import MemoryClient
from bedrock_agentcore.memory.integrations.strands.config import AgentCoreMemoryConfig
from bedrock_agentcore.memory.integrations.strands.session_manager import AgentCoreMemorySessionManager

# Define search_knowledge_base tool with proper error handling
@strands.tool
def search_knowledge_base(query: str) -> str:
    """Search the Bedrock knowledge base for relevant medical documents."""
    try:
        # Ensure we have the required environment variables
        kb_id = os.getenv("BEDROCK_KB_ID")
        if not kb_id:
            return "Error: Knowledge base ID not configured. Please set BEDROCK_KB_ID environment variable."
        
        # Create runtime client if needed
        bedrock_agent_runtime = session.client("bedrock-agent-runtime")
        
        response = bedrock_agent_runtime.retrieve(
            knowledgeBaseId=kb_id,
            retrievalQuery={"text": query},
            retrievalConfiguration={
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"  # Use hybrid search for better results
                }
            }
        )
        
        results = []
        for result in response.get("retrievalResults", []):
            content = result.get("content", {}).get("text", "")
            score = result.get("score", 0)
            location = result.get("location", {})
            
            # Get source file
            s3_location = location.get("s3Location", {})
            source_uri = s3_location.get("uri", "")
            source_file = source_uri.split("/")[-1] if source_uri else "Unknown source"
            
            # Format result
            results.append(f"**Source:** {source_file} (Relevance: {score:.2f})\n{content[:500]}...")
        
        if results:
            return "\n\n---\n\n".join(results)
        else:
            return f"No documents found for query: '{query}'. The knowledge base may be empty or still processing."
            
    except Exception as e:
        error_msg = str(e)
        if "ResourceNotFoundException" in error_msg:
            return f"Error: Knowledge base {kb_id} not found. Please verify the BEDROCK_KB_ID is correct."
        elif "ValidationException" in error_msg:
            return f"Error: Invalid query or configuration. Details: {error_msg}"
        else:
            return f"Error searching knowledge base: {error_msg}"

In [5]:
# Test the search function before creating agent
print("üîç Testing knowledge base search function...")
test_result = search_knowledge_base("common cold symptoms")
print(f"Test result: {test_result[:200]}...")

if "Error" in test_result:
    print("\n‚ö†Ô∏è Knowledge base search is not working. Fixing configuration...")
    print(f"Current KB ID: {os.getenv('BEDROCK_KB_ID')}")
    print(f"Current Region: {os.getenv('AWS_REGION')}")
else:
    print("\n‚úÖ Knowledge base search is working!")

üîç Testing knowledge base search function...
Test result: **Source:** Understanding_the_symptoms_of_the_common_cold_and_influenza.md (Relevance: 0.60)
Generally the severity of symptoms increases rapidly, peaking 2‚Äì3 days after infection, with a mean duratio...

‚úÖ Knowledge base search is working!


### Part 2 - Step 9: Create Memory Client

In [6]:
# Initialize the memory client
memory_client = MemoryClient(region_name=os.getenv("AWS_REGION", "us-west-2"))

# Try to list existing memories first
try:
    existing_memories = memory_client.gmcp_client.list_memories()
    memory_list = existing_memories.get('memories', [])
    
    # Check if a medical memory already exists
    existing_medical_memory = None
    for mem in memory_list:
        if 'MedicalAgentMemory' in mem.get('name', ''):
            existing_medical_memory = mem
            print(f"üìö Found existing memory: {mem['name']}")
            break
    
    if existing_medical_memory:
        # Use existing memory
        MEMORY_ID = existing_medical_memory.get('id')
        print(f"‚úÖ Reusing existing memory: {MEMORY_ID}")
    else:
        raise Exception("No existing memory found, will create new one")
        
except Exception as e:
    # Create new memory with unique timestamp
    print("üìù Creating new memory...")
    try:
        # Add seconds to make name unique
        comprehensive_memory = memory_client.create_memory_and_wait(
            name=f"MedicalAgentMemory_{datetime.now().strftime('%Y%m%d_%H%M%S')}", 
            description="Memory for medical document analysis with user preferences",
            strategies=[
                {
                    "summaryMemoryStrategy": {
                        "name": "SessionSummarizer",
                        "namespaces": ["/summaries/{actorId}/{sessionId}"]
                    }
                },
                {
                    "userPreferenceMemoryStrategy": {
                        "name": "PreferenceLearner",
                        "namespaces": ["/preferences/{actorId}"]
                    }
                },
                {
                    "semanticMemoryStrategy": {
                        "name": "FactExtractor",
                        "namespaces": ["/facts/{actorId}"]
                    }
                }
            ]
        )
        MEMORY_ID = comprehensive_memory.get('id')
        print(f"‚úÖ New memory created: {MEMORY_ID}")
    except Exception as create_error:
        print(f"‚ö†Ô∏è Could not create memory: {create_error}")
        print("Continuing without memory functionality...")
        MEMORY_ID = None

üìù Creating new memory...
‚úÖ New memory created: MedicalAgentMemory_20251122_012556-5nIp5u3Cls


In [9]:
# Set up memory configuration if memory exists
if MEMORY_ID:
    ACTOR_ID = f"medical_user_{datetime.now().strftime('%H%M%S')}"
    SESSION_ID = f"session_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    print(f"   Actor: {ACTOR_ID}")
    print(f"   Session: {SESSION_ID}")
    
    # Configure memory
    memory_config = AgentCoreMemoryConfig(
        memory_id=MEMORY_ID,
        session_id=SESSION_ID,
        actor_id=ACTOR_ID
    )
    
    # Create session manager
    session_manager = AgentCoreMemorySessionManager(
        agentcore_memory_config=memory_config,
        region_name=os.getenv("AWS_REGION", "us-west-2")
    )
else:
    session_manager = None
    print("‚ö†Ô∏è Agent will run without memory")

   Actor: medical_user_012853
   Session: session_20251122_012853


### Part 2 - Step 10: Initialize the Strand Agent

In [10]:
from strands import Agent

# Create the agent with memory and tools
medical_agent = Agent(
    model=os.getenv("BEDROCK_MODEL_ID"),
    name="Medical Document Analyzer with Memory",
    description="Expert agent for medical documents with conversation memory",
    system_prompt="""
        You are a medical document analysis assistant with memory capabilities.
        You remember our conversations, user preferences, and important facts.

        Your capabilities:
        - Search and analyze medical documents from the knowledge base
        - Remember user preferences and conversation history
        - Provide personalized, contextual responses
        - Learn from interactions to improve future responses

        When responding:
        - Reference previous conversations when relevant
        - Remember stated preferences (detail level, focus areas, etc.)
        - Maintain context across multi-turn conversations
        - Acknowledge when recalling information from memory

        You have access to medical documents about common cold, treatments, and symptoms.
        Always provide evidence-based insights from the documents.
        """,
    session_manager=session_manager,
    tools=[search_knowledge_base]
)

print(f"\n‚úÖ Medical agent ready with memory!")
print(f"   Model: {os.getenv('BEDROCK_MODEL_ID')}")
print(f"   Tools: {medical_agent.tool_names}")
print("\nüí° The agent will now remember your preferences and conversation history")


‚úÖ Medical agent ready with memory!
   Model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
   Tools: ['search_knowledge_base']

üí° The agent will now remember your preferences and conversation history


### Part 2 - Step 11: Test It and Make It Interactive!

In [11]:
print("="*70)
print("Medical Agent - Interactive Chat")
print("="*70)
print("\nAsk questions about medicine.")
print("Type 'exit', 'quit', or 'bye' to end the conversation.")
print("="*70 + "\n")

while True:
    try:
        user_input = input("\nüßë You: ").strip()

        if not user_input:
            continue

        if user_input.lower() in ['exit', 'quit', 'bye', 'q']:
            print("\nüëã Ending conversation. Goodbye!")
            break

        print("\nü§ñ Agent: ", end="")
        result = medical_agent(user_input)
        print(result)

    except KeyboardInterrupt:
        print("\n\nüëã Conversation interrupted. Goodbye!")
        break
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        print("Please try again or type 'exit' to quit.")

Medical Agent - Interactive Chat

Ask questions about medicine.
Type 'exit', 'quit', or 'bye' to end the conversation.




üßë You:  Hello! I'm researching common cold treatments.



ü§ñ Agent: Hello! I'm happy to help you research common cold treatments. Let me search the knowledge base for relevant information on this topic.
Tool #1: search_knowledge_base
Great! I found some useful information about common cold treatments. Based on the medical documents in the knowledge base, here's what the evidence shows:

## Key Treatment Information:

**Traditional Pharmacologic Treatments:**
- **Analgesics/Anti-inflammatory drugs** - These are among the studied interventions
- **Antihistamines (monotherapy)** - The evidence shows they have **no meaningful effect** in treating the common cold. A meta-analysis found no significant improvement in general symptoms
- **Decongestants** - Both short-term and long-term relief options have been studied
- **Combination treatments** - Decongestants plus antihistamines have been researched
- **Antibiotics** - Not recommended, as the common cold is almost exclusively viral (bacterial infections are found in only about 5% of cases)

**N


üßë You:      response = agent("What are the most effective treatments mentioned in the documents?")



ü§ñ Agent: Let me search for more specific information about the effectiveness of common cold treatments to give you a comprehensive answer.
Tool #2: search_knowledge_base

Tool #3: search_knowledge_base

Tool #4: search_knowledge_base
Based on the medical documents, here are **the most effective treatments** for the common cold:

## **MOST EFFECTIVE (Supported by Best Evidence):**

### **1. Oral Zinc Supplements (Adults)**
- **Evidence:** Meta-analysis showed significant reduction in colds (RR 0.64, 95% CI 0.47 to 0.88)
- **Benefit:** About 0.5-1.4 fewer colds over 5-7 months
- **Rating:** Likely beneficial

### **2. Honey at Bedtime (Children over 1 year)**
- **Indication:** Specifically for cough
- **Rating:** Best evidence among non-traditional treatments

### **3. Ibuprofen**
- **Benefit:** Superior to acetaminophen for fever treatment in children
- **Category:** Anti-inflammatory/analgesic

### **4. Antihistamine + Decongestant Combination**
- **Evidence:** Meta-analysis showed


üßë You:  I prefer natural remedies over pharmaceutical ones



ü§ñ Agent: Thank you for sharing that preference! I'll remember that you prefer natural remedies. Let me search for more detailed information about natural and non-pharmaceutical treatments for the common cold.
Tool #5: search_knowledge_base

Tool #6: search_knowledge_base

Tool #7: search_knowledge_base
Perfect! Since you prefer natural remedies, let me give you a comprehensive overview of the evidence for natural and non-pharmaceutical treatments from the documents:

## **NATURAL REMEDIES WITH BEST EVIDENCE:**

### **1. ‚úÖ HONEY (Highly Recommended)**
- **For:** Cough in children over 1 year old
- **When:** At bedtime
- **Evidence:** Best evidence among non-traditional treatments
- **Safety:** Natural and well-tolerated

### **2. ‚úÖ ZINC SUPPLEMENTS (Recommended)**
- **Form:** Oral zinc (zinc sulfate 10-15 mg)
- **Effectiveness:** 
  - Decreased odds of catching a cold by 58% (OR 0.42)
  - Reduced cold duration by 1.4 days
  - Number needed to treat: 6 people
- **Note:** Though a


üßë You:  What treatment options would you recommend for me?



ü§ñ Agent: Based on our conversation and remembering that you prefer natural remedies, here are my personalized treatment recommendations for you:

## **MY TOP NATURAL RECOMMENDATIONS FOR YOU:**

### **For Prevention & Reducing Cold Frequency:**
**1. Zinc Supplements (Daily)**
- **Dosage:** 10-15 mg zinc sulfate daily
- **Evidence:** Strong - reduces cold incidence by 58% and can prevent 0.5-1.4 colds over 5-7 months
- **Best for:** Adults looking to reduce how often they get colds
- **This aligns perfectly with your natural remedy preference!**

### **For Active Cold Treatment:**

**2. Honey (For Cough)**
- **When:** At bedtime
- **Best for:** If you develop a cough (children over 1 year)
- **Evidence:** Best-supported natural treatment for cough symptoms
- **Bonus:** Completely natural, no side effects

**3. Zinc (During Cold)**
- Can also help reduce cold duration by about 1.4 days when taken during illness

### **Worth Trying (More Experimental):**

**4. Garlic Supplements**
- **


üßë You:  quit



üëã Ending conversation. Goodbye!


---

## Part 3: Invoice Extraction Pipeline
This section sets up the batch invoice extraction Lambda that processes markdown files from `output/invoices/` and extracts structured data.

### Part 3 - Step 1: Create the Extraction Lambda Deployment Package

In [23]:
# Create deployment package for extraction Lambda
source_files = ["batch_invoice_extractor.py"]
requirements = ["landingai-ade","boto3"]

zip_path = create_deployment_package(
    source_files=source_files,
    requirements=requirements,
    output_zip="extraction_lambda.zip",
    package_dir="extraction_package"
)

üì¶ Creating deployment package: extraction_lambda.zip
   Installing dependencies: landingai-ade, boto3
   Adding source: batch_invoice_extractor.py
   Creating zip archive...
‚úÖ Package created: extraction_lambda.zip (19.8 MB)


In [24]:
# Create or reuse IAM role
role_arn = create_or_update_lambda_role(
    iam_client=iam,
    role_name="lambda-invoice-extractor",
    description="Role for batch invoice extraction Lambda"
)

‚ÑπÔ∏è Using existing role: lambda-invoice-extractor


### Part 3 - Step 2: Deploy the Extraction Lambda Function

In [25]:
# Deploy the extraction Lambda
env_vars = {
    "VISION_AGENT_API_KEY": os.getenv("VISION_AGENT_API_KEY"),
    "S3_BUCKET": os.getenv("S3_BUCKET"),
    "INVOICE_MARKDOWN_PATH": "output/invoices/",
    "EXTRACTED_FOLDER": "extracted/"
}

response = deploy_lambda_function(
    lambda_client=lambda_client,
    function_name="batch-invoice-extractor",
    zip_file="extraction_lambda.zip",
    role_arn=role_arn,
    handler="batch_invoice_extractor.lambda_handler",
    env_vars=env_vars,
    runtime="python3.10",
    timeout=900,
    memory_size=3008
)

print(f"\nüìã Lambda Function Details:")
print(f"   Name: {response.get('FunctionName')}")
print(f"   Runtime: {response.get('Runtime')}")
print(f"   Memory: {response.get('MemorySize')} MB")
print(f"   Timeout: {response.get('Timeout')} seconds")

üöÄ Deploying Lambda function: batch-invoice-extractor
‚ÑπÔ∏è Function exists, updating...
   Code updated, waiting for deployment...
‚úÖ Lambda function updated: batch-invoice-extractor

üìã Lambda Function Details:
   Name: batch-invoice-extractor
   Runtime: python3.10
   Memory: 3008 MB
   Timeout: 900 seconds


### Part 3 - Step 3: Upload the Folder

In [38]:
# Upload test invoices to S3 input folder
local_folder = "invoices/" 

# Check if folder exists and upload
if os.path.exists(local_folder):
    count = upload_folder_to_s3(
        s3_client=s3_client,
        local_folder = local_folder,
        s3_prefix=f"input/{local_folder}",
        bucket=os.getenv("S3_BUCKET"),
        file_extensions=[".pdf", ".PDF"]
    )
    print(f"\n‚è≥ Waiting for automatic parsing to complete...")
    print("   (The existing Lambda will automatically convert PDFs to markdown)")
else:
    print(f"‚ö†Ô∏è Folder not found: {local_invoice_folder}")
    print("   Please create this folder and add invoice PDFs, or update the path")

üì§ Uploading invoices/ ‚Üí s3://universal-docs-877560973657/input/invoices/
   (Skipping files that already exist in S3)
   ‚¨ÜÔ∏è Uploading: invoice_22.pdf
   ‚¨ÜÔ∏è Uploading: invoice_4.pdf
   ‚¨ÜÔ∏è Uploading: invoice_13.pdf
   ‚¨ÜÔ∏è Uploading: invoice_26.pdf
   ‚¨ÜÔ∏è Uploading: invoice_8.PDF
   ‚¨ÜÔ∏è Uploading: invoice_17.pdf
   ‚¨ÜÔ∏è Uploading: invoice_24.pdf
   ‚¨ÜÔ∏è Uploading: invoice_15.pdf
   ‚¨ÜÔ∏è Uploading: invoice_6.pdf
   ‚¨ÜÔ∏è Uploading: invoice_20.pdf
   ‚¨ÜÔ∏è Uploading: invoice_2.pdf
   ‚¨ÜÔ∏è Uploading: invoice_11.pdf
   ‚¨ÜÔ∏è Uploading: invoice_19.pdf
   ‚¨ÜÔ∏è Uploading: invoice_23.pdf
   ‚¨ÜÔ∏è Uploading: invoice_5.pdf
   ‚¨ÜÔ∏è Uploading: invoice_14.pdf
   ‚¨ÜÔ∏è Uploading: invoice_10.pdf
   ‚¨ÜÔ∏è Uploading: invoice_1.pdf
   ‚¨ÜÔ∏è Uploading: invoice_27.pdf
   ‚¨ÜÔ∏è Uploading: invoice_18.pdf
   ‚¨ÜÔ∏è Uploading: invoice_9.pdf
   ‚¨ÜÔ∏è Uploading: invoice_25.pdf
   ‚¨ÜÔ∏è Uploading: invoice_16.pdf
   ‚¨ÜÔ∏è Uploading: invoice_7.pdf
   ‚¨ÜÔ∏è Uploading: 

### Part 3 - Step 4: Run the Extraction

In [19]:
# Check how many files we're processing
invoice_files = monitor_s3_folder(
    s3_client=s3_client,
    bucket=os.getenv("S3_BUCKET"),
    prefix="output/invoices/"
)

invoice_md_files = [f for f in invoice_files if f.endswith('.md')]
print(f"   Found {len(invoice_md_files)} markdown files to process")
print(f"   ‚è±Ô∏è Estimated time: {len(invoice_md_files) * 2}-{len(invoice_md_files) * 3} seconds\n")

üìÅ Monitoring s3://universal-docs-877560973657/output/invoices/
   Found 27 files
   Found 27 markdown files to process
   ‚è±Ô∏è Estimated time: 54-81 seconds



In [26]:
# Invoke extraction Lambda
start_time = time.time()

result = invoke_lambda_sync(
    lambda_client=lambda_client,
    function_name="batch-invoice-extractor",
    payload=None,
    show_logs=True  # Set to True to see Lambda logs
)

elapsed = time.time() - start_time
print(f"‚è±Ô∏è Lambda returned after {elapsed:.1f} seconds\n")

‚ö° Invoking Lambda: batch-invoice-extractor
‚úÖ Lambda completed successfully in 22.0 seconds

üìã Lambda Logs:
------------------------------------------------------------
Found: output/invoices/invoice_4.md
Found: output/invoices/invoice_5.md
Found: output/invoices/invoice_6.md
Found: output/invoices/invoice_7.md
Found: output/invoices/invoice_8.md
Found: output/invoices/invoice_9.md
üìÑ Found 27 markdown files to process
ü§ñ Starting extraction for 27 invoices...
üìä Extracted Invoice Data:
üìà Summary Statistics:
Total invoices: 27
Total value: $6,134,414.89
Unique customers: 22
Unique suppliers: 27
‚úÖ Saved consolidated CSV: extracted/batch_all_invoices_20251122_015318.csv
üìä Contains all 27 invoices in one file
END RequestId: e362f255-ae03-4705-8c25-09ccbafa90c4
REPORT RequestId: e362f255-ae03-4705-8c25-09ccbafa90c4	Duration: 21008.08 ms	Billed Duration: 21705 ms	Memory Size: 3008 MB	Max Memory Used: 127 MB	Init Duration: 696.10 ms	
--------------------------------------

### Part 3 - Step 5: Save & Export the Extracted Output as a CSV 

In [29]:
# Download and display the combined CSV results from S3 using IPython display
import pandas as pd
from IPython.display import display

bucket = os.getenv("S3_BUCKET")

# List files in extracted folder
response = s3_client.list_objects_v2(
   Bucket=bucket,
   Prefix="extracted/batch_all_invoices_"
)

if "Contents" in response:
   # Get the most recent combined CSV
   csv_files = [obj for obj in response["Contents"] if obj["Key"].endswith(".csv")]

   if csv_files:
       # Sort by last modified to get most recent
       latest_csv = sorted(csv_files, key=lambda x: x["LastModified"], reverse=True)[0]
       csv_key = latest_csv["Key"]

       # Download to local file
       local_csv = "extracted_invoices.csv"
       s3_client.download_file(bucket, csv_key, local_csv)

       print(f"üì• Downloaded: {csv_key} ‚Üí {local_csv}\n")

       # Load with pandas
       df = pd.read_csv(local_csv)

       print("üìä Extracted Data as DataFrame:")
       print("="*80)

       # Set display options for better visibility
       pd.set_option('display.max_columns', None)
       pd.set_option('display.width', None)
       pd.set_option('display.max_colwidth', 50)
       pd.set_option('display.max_rows', 30)

       # Use IPython display for nice formatting
       display(df)

üì• Downloaded: extracted/batch_all_invoices_20251122_015318.csv ‚Üí extracted_invoices.csv

üìä Extracted Data as DataFrame:


Unnamed: 0,source_file,invoice_number,invoice_date,customer,supplier,subtotal,tax,total,currency,line_items_count,status
0,invoice_1.md,INV33543191,2020-07-29,Abaxys Tech LLC,Zoom Video Communications Inc.,149.9,0.0,149.9,USD,1,PAID
1,invoice_10.md,1000110140,2025-05-15,ANDREA KROPP,Sheraton Tucson Hotel & Suites,909.0,121.53,1030.53,USD,9,PAID
2,invoice_11.md,2071221,2021-08-30,Souhail Martesse,DollarFulfillment,1800.87,,1800.87,USD,1,
3,invoice_12.md,11828454,2025-03-27,MRS ANDREA KROPP,Condor Flugdienst GmbH,2202.46,377.5,2579.96,USD,6,PAID
4,invoice_13.md,812,2021-12-02,SMARTQUIP OVERSEAS COMPANY,KANDHAN METAL COMPANY,5102920.0,918525.6,6021446.0,INR,1,
5,invoice_14.md,40458946,2019-02-23,"Gnr-Grupo Novo Rock, Lda",Thomann GmbH,77.24,0.0,77.24,EUR,5,
6,invoice_15.md,0000329003,2019-04-04,Nazish,Jade E-Services Pakistan Private Limited,147.0,,147.0,PKR,1,
7,invoice_16.md,1,2023-03-20,Mansoer Walizada,Walmart,1529.94,110.92,1640.86,USD,1,PAID
8,invoice_17.md,2014/00355,2014-12-10,Sandip Patil,Variant Technologies,2800.0,,2800.0,INR,1,
9,invoice_18.md,00000116271,2020-02-10,"Meridian Venture Services, LLC","HOWARD CUSTOM TRANSFERS, INC.",270.0,,247.31,USD,2,PAID


# The End - Thank you!