Skip to content

This workshop demonstrates how to build a Document parser and query engine with Amazon Textract and other services, such as ElasticSearch and DynamoDB.

License

Notifications You must be signed in to change notification settings

josephmisiti/amazon-textract-enhancer

 
 

Repository files navigation

Amazon Textract Enhancer - Overview

This workshop demonstrates how to build a text parser and feature extractor with Amazon Textract. With amazon Textract you can detect text from a PDF document or a scanned image of a printed document to extract lines of text, using Text Detection API. In addition, you can also use Document Analysis API to extract tables and forms from the scanned document.

It is straightforward to invoke this APIs from AWS CLI or using Boto3 Python library and pass either a pointer to the document image stored in S3 or the raw image bytes to obtain results. However handling large volumes of documents this way becomes impractical for several reasons:

  • Making a synchronous call to query Textract API is not possible for multi-page PDF documents
  • Synchronous call will exceed provisioned throughput if used for a large number of documents within a short period of time
  • If multiple query with same document is needed, triggerign multiple Textract API invocation, cost increases rapidly
  • Textract sends analysis results with rich metadata, but the strucutres of tables and forms are not immediately apparent without some post-processing

In this Textract enhancer solution, as demonstrated in this workshop, following approaches are used to provide for a more robust end to end solution.

  • Lambda functions triggered by document upload to specific S3 bucket to submit document analysis and text detection jobs to Textract
  • API Gateway methods to trigger Textract job submission on-demand
  • Asynchronous API calls to start Document analysis and Text detection, with unique request token to prevent duplicate submissions
  • Use of SNS topics to get notified on completion of Textract jobs
  • Automatically triggered post processing Lambda functions to extract actual tables, forms and lines of text, stored in S3 for future querying
  • Job status and metdata tracked in DynamoDB table, allowing for troubleshooting and easy querying of results
  • API Gateway methods to retrieve results anytime without having to use Textract

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

Prerequisites

In order to complete this workshop you'll need an AWS Account with access to create:

  • AWS IAM Roles
  • S3 Bucket
  • S3 bucket policies
  • SNS topics
  • DynamoDB tables
  • Lambda functions
  • API Gateway endpoints and deployments

1. Launch stack

Textract Enhancer solution components can each be built by hand, either using AWS Console or using AWS CLI. AWS Cloudformation on the other hand provides mechanism to script the hard work of launching the whole stack.

You can use the button below to launch the solution stack, the component details of which you can find in the following section.

Region Launch
US East (N. Virginia) Launch Textract Enhancer in us-east-1

2. Architecture

The solution architecture is based solely upon serverless Lambda functions, invoking Textract API endpoints. The architecture uses Textract in asynchronous mode, and uses a DynamoDB table to keep track of job status and response location. Job submission architecture

The solution also uses Rest API backed by anotyher set of Lambda functions and the DynamoDB table to provide for fast querying of the resulting documents from S3 bucket. Result retrieval architecture

3. Solution components

3.1. DyanmoDB Table

  • When a Textract job is submitted in asynchronous mode, using a request token, it creates a unique job-id is created. For any subsequent submissions with same document, it prevents Textract from running the same job over again. Since in this solution, two different types of jobs are submitted, one for DocumentAnalysis and one for TextDetection, a DynamoDB table is used with JobId as HASH key and JobType as RANGE key, to track the status of the job.
  • In order to facilitate table scan with the document location, the table also use a global secondary index, with DocumentBucket as HASH key and DocumentPath as RANGE key. This information is used by the retrieval functions later when an API request is sent to obtain the tables, forms and lines of texts.
  • Upon completion of a job, post processing Lambda functions update the corresponding records in this DynamoDB table with location of the extracted files, as stored in S3 bucket, and other metadata such as completion time, number of pages, lines, tables and form fields.
Following snippet shows the schema definition used in defining the table (expand for details)

"AttributeDefinitions": [
    {
        "AttributeName": "JobId",
        "AttributeType": "S"
    },       
    {
        "AttributeName": "JobType",
        "AttributeType": "S"
    },                                
    {
        "AttributeName": "DocumentBucket",
        "AttributeType": "S"
    },
    {
        "AttributeName": "DocumentPath",
        "AttributeType": "S"
    }                    
],
"KeySchema": [
    {
        "AttributeName": "JobId",
        "KeyType": "HASH"
    },
    {
        "AttributeName": "JobType",
        "KeyType": "RANGE"
    }                    
],
"GlobalSecondaryIndexes": [
    {
        "IndexName": "DocumentIndex",
        "KeySchema": [
                {
                    "AttributeName": "DocumentBucket",
                    "KeyType": "HASH"
                },
                {
                    "AttributeName": "DocumentPath",
                    "KeyType": "RANGE"
                }
        ],
        "Projection": {
            "ProjectionType": "KEYS_ONLY"
        },
        "ProvisionedThroughput": {
                "ReadCapacityUnits": 5,
                "WriteCapacityUnits": 5
        }
    }
],   

3.2. Lambda execution role

  • Lambda functions used in this solution prototype uses a common execution role that allows it to assume the role, to which required policies are attached.
Following snippet shows the assume role policy document for the Lambda execution role (expand for details)

"AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "lambda.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole"
            ]
        }                       
    ]
}

  • Basic execution policy allows the Lambda functions to publish events to Cloudwatch logs.
Following snippet shows the basic execution role policy document (expand for details)

{
    "PolicyName": "lambda_basic_execution_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": "arn:aws:logs:*:*:*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "xray:PutTraceSegments"
                ],
                "Resource": "*"
            }                                
        ]
    }
}

  • Textract access policy attached to this role allows Lambda functions to execute Textract API calls.
Following snippet shows the Textract access policy document (expand for details)

{
    "PolicyName": "textract_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "textract:*",
                "Resource": "*"
            }                             
        ]
    }
} 

  • DynamoDB access policy attached to this role allows Lambda functions to write records to and read records from the tracking table.
Following snippet shows the DynamoDB access policy document (expand for details)

{
    "PolicyName": "dynamodb_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "dynamodb:*",
                "Resource": "*"
            }                             
        ]
    }
}

  • An IAM access policy is attached to this role, to enable the Lambda function because when invoked with a bucket name owned by another AWS account, the job submission Lambda function automatically creates an IAM policy and attaches to itself, thereby allowing access to documents stored in the provided bucket.
Following snippet shows the IAM access policy document (expand for details)

{
    "PolicyName": "iam_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "iam:*",
                "Resource": "*"
            }                             
        ]
    }
}

3.3. DyanmoDB Table

3.4. SNS Topic

3.5. Textract service role

3.6. Job submission - Lambda function

3.7. Post Processing - Lambda functions

3.8. S3 Bucket

3.9. Rest API

About

This workshop demonstrates how to build a Document parser and query engine with Amazon Textract and other services, such as ElasticSearch and DynamoDB.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 89.3%
  • JavaScript 7.8%
  • HTML 2.9%