A serverless invoice processing pipeline that extracts data from PDF invoices using AWS Textract, stores it in DynamoDB, and generates AI-powered customer insights and spending pattern analysis using Amazon Bedrock.
- AWS Account and admin access
- AWS SAM CLI installed
- Docker
$ sam build --use-container
$ sam deploy \
--stack-name aws-invoice-processor \
--s3-bucket aws-sam-cli-managed-default-<xxx> \
--capabilities CAPABILITY_IAM \
--region eu-west-1The solution can be used by a company to process invoices it sends to its customers. When someone uploads a PDF invoice to an S3 bucket—for example, for Customer A—the solution extracts the data using AWS Textract and stores it in DynamoDB.
If it’s the first invoice for that customer, no insights will be generated, as there is no historical data yet. For the second and subsequent invoices, the solution generates insights based on the customer's historical data—up to the last five invoices. It will always compare the newest invoice with the four previous ones. These insights are created using Amazon Bedrock and stored in S3.
From a more technical perspective:
- Upload Invoice:
- The user uploads one or more PDF invoices from one or more customers to the S3 bucket specified in the
InvoiceBucketparameter. Old PDFs expire after 90 days.
- The user uploads one or more PDF invoices from one or more customers to the S3 bucket specified in the
- Extract Data:
- The
InvoiceTextractFunctionis triggered by the S3 upload event. - It uses AWS Textract (async) to extract data from the PDF invoice.
- Textract will read tables and forms in the PDF.
- Textract will publish the result to an SNS topic.
- An SQS queue is subscribed to the SNS topic to to reliably buffer and deliver messages to the Lambda function.
- The
- Store Data:
- The
InvoiceDataWriteris polling messages from the SQS queue. - It processes the Textract result and stores the invoice data in DynamoDB. It supports multi-page invoices. It will fail when there is missing data in the invoice. It will not fail but send a warning when a duplicate invoice is uploaded. An invoice should be unique and can not be updated.
- we make use of the Textract response parser to extract the correct data.
- The
- Generate Insights:
- The
InsightGeneratorFunctionis triggered by the DynamoDB stream (when a new invoice is added to DynamoDB). It's only triggered when invoice data is added, not when other data is inserted. - It gets the last five invoices for a customer from DynamoDB.
- It checks if the customer has at least 2 invoices for analysis.
- It tracks processed invoices to avoid duplicate analysis.
- For new customers: Creates tracking record with initial invoice numbers
- For existing customers: Compares new invoice numbers with already processed ones. If a new recent invoice is found, it will update the tracking record and generate insights.
- It uses Amazon Bedrock to generate insights for the most recent invoice compared with the 4 previous invoices and writes it to S3.
- The insights are generated for each most recent invoice for each customer.
- The
The solution is designed to process multiple invoices in parallel. You can upload multiple PDF files for multiple customers at the same time for the InvoiceBucket. The InvoiceTextractFunction will handle each event asynchronously by calling AWS Textract service to extract data from each invoice independently. Textract will publish the results to an SNS topic to which an SQS queue is subscribed.
The InvoiceDataWriter will then process each SQS message, this might happen in batches, so one Lambda execution can process multiple messages at once. Here for we make use of "batch processing" using Powertools for AWS Lambda (Python).
When one message fails in a batch, not the whole batch will fail and retry, but only the failed message will be retried, by this we can avoid unnecessary retries for the whole batch.
Similarly, the InsightGeneratorFunction is triggered by the DynamoDB stream, which can also process multiple records in batches, similar to the `InvoiceDataWriter.
Some edge cases can occur when uploading multiple invoices at the same time for the same customer.
To solve those edge cases, we store the TRACKING#item per customer in DynamoDB. This item is used to track the last processed invoices that were taken into account for generating insights. The InsightGeneratorFunction will only generate insights if the invoice that triggered the function is newer than one of the last five invoices taken into account for generating insights.
Let's clarify this with an example. I'm uploading invoice 049 and 050 for the same customer at the same time. There is other data for the customer already. A few scenarios can occur:
- It can happen that one (or more) of the invoices is still being processed by the
InvoiceDataWriterwhile theInsightGeneratorFunctionis already triggered by the DynamoDB stream: - If
049is processed first, theInsightGeneratorFunctionwill generate an insight file for049and next an insight file for050(because050is newer than049and will become the newest invoice for that customer). - If
050is processed first, theInsightGeneratorFunctionwill generate an insight file for050and then update the insight file050when data for049is processed.050is the newest invoice, but049will impact the insights because it's one of the last five invoices for that customer. - If
049is processed first, theInsightGeneratorFunctionwill generate an insight file - but it can happen that during this timing the data for050is just written to DynamoDB - so the event for049will take the data of050into account and generate an insight file for050already. The050event will then not generate a new insight file but show "All invoices already processed".
(Customer data in invoices is fictional and does not represent real companies)
- Invoice ID - Customer
- 2025-INV-046 - CarCraft Industries SA
- 2025-INV-047 - CarCraft Industries SA
- 2025-INV-048 - CarCraft Industries SA
- 2025-INV-049 - CarCraft Industries SA
- 2025-INV-050 - CarCraft Industries SA --> multi page invoice (textract should handle this properly)
- 2025-INV-051 - AutoMotive Excellence NV
- 2025-INV-052 - CarCraft Industries SA --> missing data (no required value for customer branch)
- 2025-INV-053 - AutoMotive Excellence NV
- 2025-INV-054 - CarCraft Industries SA
We successfully tested the solution with the following scenarios:
- upload invoice
2025-INV-048:- Should write data to DynamoDB but not generate insights (there is no history).
- Upload invoices at the same time for the same customer (
2025-INV-049and2025-INV-050):- Should write data to DynamoDB for both invoices and generate insights. It depends on the order of processing if one or two insights are generated. (See parallel processing section).
- Upload an older invoice (
2025-INV-047) that will impact the historical data:- Should update the insights based on the new historical data (if it's one of the last 5 invoices).
- Upload an invoice with missing data (
2025-INV-052) and an invoice with valid data for the same customer (2025-INV-054), upload two valid invoices for a the same new customer (2025-INV-051and2025-INV-053):- One invoice Should fail and not write data to DynamoDB. --> only that one should be retried (not the full batch)
- One invoice for that same customer should write data to DynamoDB and generate insights.
- One invoice is the first invoice for a new customer, it will be written to DynamoDB but it should not generate insights.
- The other invoice should write data to DynamoDB and generate insights independently.
- Upload a duplicate invoice (
2025-INV-048):- Should give a warning (no error) and not write data to DynamoDB.
- Upload an invoice that is so old that it will not impact the historical data (
2025-INV-046):- Should not update the insights (if it's older than 5 most recent invoices).
The solution uses Amazon CloudWatch for monitoring and logging. Logs are available for 30 days. An Alarm will be triggered if one of the Lambda functions fails. There is also a SQS Dead Letter Queue (DLQ) configured for the InvoiceDataWriter so failed messages can be retried later. The DLQ is configured to keep messages for 14 days.
This project uses the following tools to maintain code quality:
- Black - Python code formatting
- Pylint - Python code analysis and linting
- cfn-lint - CloudFormation template validation
The setup.cfg file is used to ignore some less relevant pylint warnings for the project.
