Skip to content

A serverless invoice processing pipeline that extracts data from PDF invoices using AWS Textract, stores it in DynamoDB, and generates AI-powered customer insights and spending pattern analysis using Amazon Bedrock.

License

Notifications You must be signed in to change notification settings

lvthillo/aws-hackathon-invoice-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aws-invoice-processor

A serverless invoice processing pipeline that extracts data from PDF invoices using AWS Textract, stores it in DynamoDB, and generates AI-powered customer insights and spending pattern analysis using Amazon Bedrock.

Architecture Overview

Architecture Diagram

Prerequisites

  • AWS Account and admin access
  • AWS SAM CLI installed
  • Docker

Deployment Steps

$ sam build --use-container
$ sam deploy \
  --stack-name aws-invoice-processor \
  --s3-bucket aws-sam-cli-managed-default-<xxx> \
  --capabilities CAPABILITY_IAM \
  --region eu-west-1

What it does

The solution can be used by a company to process invoices it sends to its customers. When someone uploads a PDF invoice to an S3 bucket—for example, for Customer A—the solution extracts the data using AWS Textract and stores it in DynamoDB.

If it’s the first invoice for that customer, no insights will be generated, as there is no historical data yet. For the second and subsequent invoices, the solution generates insights based on the customer's historical data—up to the last five invoices. It will always compare the newest invoice with the four previous ones. These insights are created using Amazon Bedrock and stored in S3.

From a more technical perspective:

  1. Upload Invoice:
    • The user uploads one or more PDF invoices from one or more customers to the S3 bucket specified in the InvoiceBucket parameter. Old PDFs expire after 90 days.
  2. Extract Data:
    • The InvoiceTextractFunction is triggered by the S3 upload event.
    • It uses AWS Textract (async) to extract data from the PDF invoice.
    • Textract will read tables and forms in the PDF.
    • Textract will publish the result to an SNS topic.
    • An SQS queue is subscribed to the SNS topic to to reliably buffer and deliver messages to the Lambda function.
  3. Store Data:
    • The InvoiceDataWriter is polling messages from the SQS queue.
    • It processes the Textract result and stores the invoice data in DynamoDB. It supports multi-page invoices. It will fail when there is missing data in the invoice. It will not fail but send a warning when a duplicate invoice is uploaded. An invoice should be unique and can not be updated.
    • we make use of the Textract response parser to extract the correct data.
  4. Generate Insights:
    • The InsightGeneratorFunctionis triggered by the DynamoDB stream (when a new invoice is added to DynamoDB). It's only triggered when invoice data is added, not when other data is inserted.
    • It gets the last five invoices for a customer from DynamoDB.
    • It checks if the customer has at least 2 invoices for analysis.
    • It tracks processed invoices to avoid duplicate analysis.
      • For new customers: Creates tracking record with initial invoice numbers
      • For existing customers: Compares new invoice numbers with already processed ones. If a new recent invoice is found, it will update the tracking record and generate insights.
    • It uses Amazon Bedrock to generate insights for the most recent invoice compared with the 4 previous invoices and writes it to S3.
    • The insights are generated for each most recent invoice for each customer.

Parallel Processing

The solution is designed to process multiple invoices in parallel. You can upload multiple PDF files for multiple customers at the same time for the InvoiceBucket. The InvoiceTextractFunction will handle each event asynchronously by calling AWS Textract service to extract data from each invoice independently. Textract will publish the results to an SNS topic to which an SQS queue is subscribed.

The InvoiceDataWriter will then process each SQS message, this might happen in batches, so one Lambda execution can process multiple messages at once. Here for we make use of "batch processing" using Powertools for AWS Lambda (Python). When one message fails in a batch, not the whole batch will fail and retry, but only the failed message will be retried, by this we can avoid unnecessary retries for the whole batch.

Similarly, the InsightGeneratorFunction is triggered by the DynamoDB stream, which can also process multiple records in batches, similar to the `InvoiceDataWriter.

Edge Cases explained

Some edge cases can occur when uploading multiple invoices at the same time for the same customer.

To solve those edge cases, we store the TRACKING#item per customer in DynamoDB. This item is used to track the last processed invoices that were taken into account for generating insights. The InsightGeneratorFunction will only generate insights if the invoice that triggered the function is newer than one of the last five invoices taken into account for generating insights.

Let's clarify this with an example. I'm uploading invoice 049 and 050 for the same customer at the same time. There is other data for the customer already. A few scenarios can occur:

  • It can happen that one (or more) of the invoices is still being processed by the InvoiceDataWriter while the InsightGeneratorFunction is already triggered by the DynamoDB stream:
  • If 049 is processed first, the InsightGeneratorFunction will generate an insight file for 049 and next an insight file for 050 (because 050 is newer than 049 and will become the newest invoice for that customer).
  • If 050 is processed first, the InsightGeneratorFunction will generate an insight file for 050 and then update the insight file 050 when data for 049 is processed. 050 is the newest invoice, but 049 will impact the insights because it's one of the last five invoices for that customer.
  • If 049 is processed first, the InsightGeneratorFunction will generate an insight file - but it can happen that during this timing the data for 050 is just written to DynamoDB - so the event for 049 will take the data of 050 into account and generate an insight file for 050 already. The 050 event will then not generate a new insight file but show "All invoices already processed".

Test Cases

(Customer data in invoices is fictional and does not represent real companies)

  • Invoice ID - Customer
  • 2025-INV-046 - CarCraft Industries SA
  • 2025-INV-047 - CarCraft Industries SA
  • 2025-INV-048 - CarCraft Industries SA
  • 2025-INV-049 - CarCraft Industries SA
  • 2025-INV-050 - CarCraft Industries SA --> multi page invoice (textract should handle this properly)
  • 2025-INV-051 - AutoMotive Excellence NV
  • 2025-INV-052 - CarCraft Industries SA --> missing data (no required value for customer branch)
  • 2025-INV-053 - AutoMotive Excellence NV
  • 2025-INV-054 - CarCraft Industries SA

We successfully tested the solution with the following scenarios:

  • upload invoice 2025-INV-048:
    • Should write data to DynamoDB but not generate insights (there is no history).
  • Upload invoices at the same time for the same customer (2025-INV-049 and 2025-INV-050):
    • Should write data to DynamoDB for both invoices and generate insights. It depends on the order of processing if one or two insights are generated. (See parallel processing section).
  • Upload an older invoice (2025-INV-047) that will impact the historical data:
    • Should update the insights based on the new historical data (if it's one of the last 5 invoices).
  • Upload an invoice with missing data (2025-INV-052) and an invoice with valid data for the same customer (2025-INV-054), upload two valid invoices for a the same new customer (2025-INV-051 and 2025-INV-053):
    • One invoice Should fail and not write data to DynamoDB. --> only that one should be retried (not the full batch)
    • One invoice for that same customer should write data to DynamoDB and generate insights.
    • One invoice is the first invoice for a new customer, it will be written to DynamoDB but it should not generate insights.
    • The other invoice should write data to DynamoDB and generate insights independently.
  • Upload a duplicate invoice (2025-INV-048):
    • Should give a warning (no error) and not write data to DynamoDB.
  • Upload an invoice that is so old that it will not impact the historical data (2025-INV-046):
    • Should not update the insights (if it's older than 5 most recent invoices).

Monitoring and Logging

The solution uses Amazon CloudWatch for monitoring and logging. Logs are available for 30 days. An Alarm will be triggered if one of the Lambda functions fails. There is also a SQS Dead Letter Queue (DLQ) configured for the InvoiceDataWriter so failed messages can be retried later. The DLQ is configured to keep messages for 14 days.

Code Quality

This project uses the following tools to maintain code quality:

  • Black - Python code formatting
  • Pylint - Python code analysis and linting
  • cfn-lint - CloudFormation template validation

The setup.cfg file is used to ignore some less relevant pylint warnings for the project.

About

A serverless invoice processing pipeline that extracts data from PDF invoices using AWS Textract, stores it in DynamoDB, and generates AI-powered customer insights and spending pattern analysis using Amazon Bedrock.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages