Amazon Textract Enhancer - Overview

This workshop demonstrates how to build a text parser and feature extractor with Amazon Textract. With amazon Textract you can detect text from a PDF document or a scanned image of a printed document to extract lines of text, using Text Detection API. In addition, you can also use Document Analysis API to extract tables and forms from the scanned document.

It is straightforward to invoke this APIs from AWS CLI or using Boto3 Python library and pass either a pointer to the document image stored in S3 or the raw image bytes to obtain results. However handling large volumes of documents this way becomes impractical for several reasons:

Making a synchronous call to query Textract API is not possible for multi-page PDF documents
Synchronous call will exceed provisioned throughput if used for a large number of documents within a short period of time
If multiple query with same document is needed, triggerign multiple Textract API invocation, cost increases rapidly
Textract sends analysis results with rich metadata, but the strucutres of tables and forms are not immediately apparent without some post-processing

In this Textract enhancer solution, as demonstrated in this workshop, following approaches are used to provide for a more robust end to end solution.

Lambda functions triggered by document upload to specific S3 bucket to submit document analysis and text detection jobs to Textract
API Gateway methods to trigger Textract job submission on-demand
Asynchronous API calls to start Document analysis and Text detection, with unique request token to prevent duplicate submissions
Use of SNS topics to get notified on completion of Textract jobs
Automatically triggered post processing Lambda functions to extract actual tables, forms and lines of text, stored in S3 for future querying
Job status and metdata tracked in DynamoDB table, allowing for troubleshooting and easy querying of results
API Gateway methods to retrieve results anytime without having to use Textract

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

Prerequisites

In order to complete this workshop you'll need an AWS Account with access to create:

AWS IAM Roles
S3 Bucket
S3 bucket policies
SNS topics
DynamoDB tables
Lambda functions
API Gateway endpoints and deployments

1. Launch stack

Textract Enhancer solution components can each be built by hand, either using AWS Console or using AWS CLI. AWS Cloudformation on the other hand provides mechanism to script the hard work of launching the whole stack.

You can use the button below to launch the solution stack, the component details of which you can find in the following section.

Region	Launch
US East (N. Virginia)

2. Architecture

The solution architecture is based solely upon serverless Lambda functions, invoking Textract API endpoints. The architecture uses Textract in asynchronous mode, and uses a DynamoDB table to keep track of job status and response location.

The solution also uses Rest API backed by anotyher set of Lambda functions and the DynamoDB table to provide for fast querying of the resulting documents from S3 bucket.

3. Solution components

3.1. DyanmoDB Table

When a Textract job is submitted in asynchronous mode, using a request token, it creates a unique job-id is created. For any subsequent submissions with same document, it prevents Textract from running the same job over again. Since in this solution, two different types of jobs are submitted, one for DocumentAnalysis and one for TextDetection, a DynamoDB table is used with JobId as HASH key and JobType as RANGE key, to track the status of the job.
In order to facilitate table scan with the document location, the table also use a global secondary index, with DocumentBucket as HASH key and DocumentPath as RANGE key. This information is used by the retrieval functions later when an API request is sent to obtain the tables, forms and lines of texts.
Upon completion of a job, post processing Lambda functions update the corresponding records in this DynamoDB table with location of the extracted files, as stored in S3 bucket, and other metadata such as completion time, number of pages, lines, tables and form fields.

Following snippet shows the schema definition used in defining the table (expand for details)

"AttributeDefinitions": [
    {
        "AttributeName": "JobId",
        "AttributeType": "S"
    },       
    {
        "AttributeName": "JobType",
        "AttributeType": "S"
    },                                
    {
        "AttributeName": "DocumentBucket",
        "AttributeType": "S"
    },
    {
        "AttributeName": "DocumentPath",
        "AttributeType": "S"
    }                    
],
"KeySchema": [
    {
        "AttributeName": "JobId",
        "KeyType": "HASH"
    },
    {
        "AttributeName": "JobType",
        "KeyType": "RANGE"
    }                    
],
"GlobalSecondaryIndexes": [
    {
        "IndexName": "DocumentIndex",
        "KeySchema": [
                {
                    "AttributeName": "DocumentBucket",
                    "KeyType": "HASH"
                },
                {
                    "AttributeName": "DocumentPath",
                    "KeyType": "RANGE"
                }
        ],
        "Projection": {
            "ProjectionType": "KEYS_ONLY"
        },
        "ProvisionedThroughput": {
                "ReadCapacityUnits": 5,
                "WriteCapacityUnits": 5
        }
    }
],

3.2. Lambda execution role

Lambda functions used in this solution prototype uses a common execution role that allows it to assume the role, to which required policies are attached.

Following snippet shows the assume role policy document for the Lambda execution role (expand for details)

"AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "lambda.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole"
            ]
        }                       
    ]
}

Basic execution policy allows the Lambda functions to publish events to Cloudwatch logs.

Following snippet shows the basic execution role policy document (expand for details)

{
    "PolicyName": "lambda_basic_execution_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": "arn:aws:logs:*:*:*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "xray:PutTraceSegments"
                ],
                "Resource": "*"
            }                                
        ]
    }
}

Textract access policy attached to this role allows Lambda functions to execute Textract API calls.

Following snippet shows the Textract access policy document (expand for details)

{
    "PolicyName": "textract_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "textract:*",
                "Resource": "*"
            }                             
        ]
    }
}

DynamoDB access policy attached to this role allows Lambda functions to write records to and read records from the tracking table.

Following snippet shows the DynamoDB access policy document (expand for details)

{
    "PolicyName": "dynamodb_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "dynamodb:*",
                "Resource": "*"
            }                             
        ]
    }
}

An IAM access policy is attached to this role, to enable the Lambda function because when invoked with a bucket name owned by another AWS account, the job submission Lambda function automatically creates an IAM policy and attaches to itself, thereby allowing access to documents stored in the provided bucket.

Following snippet shows the IAM access policy document (expand for details)

{
    "PolicyName": "iam_access_policy",
    "PolicyDocument": {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "iam:*",
                "Resource": "*"
            }                             
        ]
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
functions		functions
images		images
templates		templates
website-source		website-source
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
aws-sdk-2.374.0.min_.js		aws-sdk-2.374.0.min_.js
html_and_javascript_for_Binoy.txt		html_and_javascript_for_Binoy.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Textract Enhancer - Overview

License Summary

Prerequisites

1. Launch stack

2. Architecture

3. Solution components

3.1. DyanmoDB Table

3.2. Lambda execution role

3.3. DyanmoDB Table

3.4. SNS Topic

3.5. Textract service role

3.6. Job submission - Lambda function

3.7. Post Processing - Lambda functions

3.8. S3 Bucket

3.9. Rest API

About

Releases

Packages

Languages

License

josephmisiti/amazon-textract-enhancer

Folders and files

Latest commit

History

Repository files navigation

Amazon Textract Enhancer - Overview

License Summary

Prerequisites

1. Launch stack

2. Architecture

3. Solution components

3.1. DyanmoDB Table

3.2. Lambda execution role

3.3. DyanmoDB Table

3.4. SNS Topic

3.5. Textract service role

3.6. Job submission - Lambda function

3.7. Post Processing - Lambda functions

3.8. S3 Bucket

3.9. Rest API

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages