## Analyzing Invoice and Receipts Using Textract

Amazon Textract uses machine learning to understand the context of invoices and receipts and automatically extracts data such as invoice or receipt date, invoice or receipt number, item prices, total amount, and payment terms etc. The invoices do not need to be in specific format.

The following is a list of the **standard fields that AnalyzeExpense currently supports**:

- Vendor Name: VENDOR_NAME
- Total: TOTAL
- Receiver Address: RECEIVER_ADDRESS
- Invoice/Receipt Date: INVOICE_RECEIPT_DATE
- Invoice/Receipt ID: INVOICE_RECEIPT_ID
- Payment Terms: PAYMENT_TERMS
- Subtotal: SUBTOTAL
- Due Date: DUE_DATE
- Tax: TAX
- Invoice Tax Payer ID (SSN/ITIN or EIN): TAX_PAYER_ID
- Item Name: ITEM_NAME
- Item Price: PRICE
- Item Quantity: QUANTITY

If the invoice and receipt has other information and you would like to extract those information, you can use Textract Query to ask natural language questions to get the answer.

In [None]:
!pip install boto3 --upgrade
!pip install awscli --upgrade
!pip install botocore --upgrade

In [None]:
!pip install amazon-textract-response-parser --upgrade

In [None]:
!pip install amazon-textract-prettyprinter --upgrade

In [None]:
import boto3
from PIL import Image
import json
import pandas as pd
import time
import requests
import urllib.parse as urlparse
import html

In [None]:
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format

In [None]:
#-- Document
s3BucketName = "my-projects-abhi-2022"   # create a bucket and change to your bucket name
s3=boto3.resource('s3')
region = boto3.session.Session().region_name

#-- Amazon Textract client
textract = boto3.client('textract', region_name=region)

Let's copy some invoices and receipts to our S3 bucket. In the below cell, replace the destination bucket (my-projects-abhi-2022) with your own bucket name.

In [None]:
!aws s3 cp ./Data s3://my-projects-abhi-2022/Textract/invoices_processing_workshop/ --recursive

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

documentName = "Textract/invoices_processing_workshop/invoice_0.png"  # change to S3 key of your document and remove the bucket name

# open it directly
with fs.open("s3://" + s3BucketName+"/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

### Using Amazon Textract AnalyzeExpense and Textract Query

In [None]:
%%time

#-- Call Amazon Textract AnalyzeExpense
response_expense = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_expense

In [None]:
pretty_printed_string = get_string(textract_json=response_expense, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

The below code will output the analyze expense API result in Key:Value pair

In [None]:
summary_entities_values = []
summary_fields = []
expense_item = []

for expense_doc in response_expense["ExpenseDocuments"]:
    for field in expense_doc["SummaryFields"]:
        kvs = {}
        if "LabelDetection" in field:
            if "ValueDetection" in field:
                kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
        else:
            kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
        summary_entities_values.append(kvs.copy())
        kvs = None

    for line_item_group in expense_doc["LineItemGroups"]:
            for line_items in line_item_group["LineItems"]:
                for field in line_items["LineItemExpenseFields"]:
                    kvs = {}
                    if "LabelDetection" in field:
                        if "ValueDetection" in field:
                            kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
                    else:
                        kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
                    expense_item.append(kvs.copy())
                    kvs = None
print("Summary Items:\n")
print(*summary_entities_values, sep='\n')
print("\nExpense Items:\n")
print(*expense_item, sep='\n')

In [None]:
response = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        },
                    ]
                }
            )

In [None]:
%store response

In [None]:
#print(json.dumps(response, indent=4))

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]

In [None]:
merchant_address

## Analyzing Receipt

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

s3BucketName = "my-projects-abhi-2022"
documentName = "Textract/invoices_processing_workshop/receipt_0.png"

# open it directly
with fs.open("s3://" + s3BucketName + "/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

In [None]:
#-- Call Amazon Textract AnalyzeExpense
response_receipt = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_receipt

In [None]:
pretty_printed_string = get_string(textract_json=response_receipt, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

Now, we will use **Textract Query** feature to extract some additional information such as Date and Merchant addresss.

In [None]:
response_query_receipt = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the date',
                            'Alias': 'Date'
                        },
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        }
                    ]
                }
            )

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response_query_receipt)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]

In [None]:
address = merchant_address.split(" ")[-2] +' '+ merchant_address.split(" ")[-1]
address

### Extracting Currency from location service

In [None]:
import sagemaker as sm
role = sm.get_execution_role()
role

**Before you can use Amazon location service, you need to ensure that you have added geo search access to SageMaker role**

In [None]:
# Using Amazon Location Service. 
        
client = boto3.client('location')
country_api = client.search_place_index_for_text(IndexName='ExamplePlaceIndex', MaxResults=1, Text=address)
if country_api['Results'] and country_api['Results'][0]['Relevance'] > 0.8:
    country=country_api['Results'][0]['Place']['Country']
    print("country from Amazon location API:", country)
    if country =='USA':
        currency = 'USD'
        print("Currency: ",currency)

### Let's try out another receipt..

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

s3BucketName = "my-projects-abhi-2022"
documentName = "Textract/invoices_processing_workshop/receipt_02.png"

# open it directly
with fs.open("s3://" + s3BucketName + "/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

In [None]:
#-- Call Amazon Textract AnalyzeExpense
response_receipt_2 = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_receipt_2

In [None]:
pretty_printed_string = get_string(textract_json=response_receipt_2, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

In [None]:
response_query_receipt_2 = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the date',
                            'Alias': 'Date'
                        },
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        }
                    ]
                }
            )

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response_query_receipt_2)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]

## Large scale document processing with Amazon Textract

This reference architecture shows how you can extract text and data from documents at scale using Amazon Textract. Below are some of the key attributes of the reference architecture:

- Process incoming documents to an Amazon S3 bucket.
- Process large backfill of existing documents in an Amazon S3 bucket.
- Serverless, highly available and highly scalable architecture.
- Easily handle spiky workloads.
- Pipelines to support both Sync and Async APIs of Amazon Textract.
- Control the rate at which you process documents without doing any complex distributed job management. This control can be important to protect your downstream systems which will be - ingesting output from Textract.
- Sample implementation which takes advantage of AWS Cloud Development Kit (CDK) to define infrastructure in code and provision it through CloudFormation.

![image.png](attachment:image.png)

### Image pipeline (Use Sync APIs of Amazon Textract)

- The process starts as a message is sent to an Amazon SQS queue to analyze a document.
- A Lambda function is invoked synchronously with an event that contains queue message.
- Lambda function then calls Amazon Textract and store result in different datastores for example DynamoDB, S3 or Elasticsearch.

### Image and PDF pipeline (Use Async APIs of Amazon Textract)

- The process starts when a message is sent to an SQS queue to analyze a document.
- A job scheduler lambda function runs at certain frequency for example every 5 minutes and poll for messages in the SQS queue.
- For each message in the queue it submits an Amazon Textract job to process the document and continue submitting these jobs until it reaches the maximum limit of concurrent jobs in your AWS account.
- As Amazon Textract is finished processing a document it sends a completion notification to an SNS topic.
- SNS then triggers the job scheduler lambda function to start next set of Amazon Textract jobs.
- SNS also sends a message to an SQS queue which is then processed by a Lambda function to get results from Amazon Textract and store them in a relevant dataset for example DynamoDB, S3 or Elasticsearch.
Your pipeline runs at maximum throughput based on limits on your account. If needed you can get limits raised for concurrent jobs and pipeline automatically adapts based on new limits.


### Steps to implement this architecture

We will use cloud9 here for deployment of this architecture. Open cloud9 from AWS console, select an EC2 instance "t2.large" and create the environment.

- Once the environment is created, open corresponding IDE
- In the terminal, check whether node.js is installed by `node --version`
- If node.js is not installed, you can install by `curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.0/install.sh | bash`
- Next, install AWS CLI by `pip install awscli --upgrade`
- Now clone the repo by `git clone https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing.git`
- In the terminal, go to the directory "textract-pipeline"
- Run `cdk bootstrap`
- Run `cdk deploy`

### Test incoming documents

- Go to the Amazon S3 bucket "textractpipeline-documentsbucketxxxx" created by the stack and upload few sample documents (jpg/jpeg, png, pdf).
- You will see output files generated for each document with a folder name "{filename}-analysis" (refresh Amazon S3 bucket to see these results).

### Source code

- s3batchproc.py Lambda function that handles event from S3 Batch operation job.
- s3proc.py Lambda function that handles s3 event for an object creation.
- docproc.py Lambda function that push documents to queues for sync or async pipelines.
- syncproc.py Lambda function that takes documents from a queue and process them using sync APIs.
- asyncproc.py Lambda function that takes documents from a queue and start async Amazon Textract jobs.
- jobresultsproc.py Lambda function that process results for a completed Amazon Textract async job.
- textract-pipeline-stack.ts CDK code to define infrastrucure including IAM roles, Lambda functions, SQS queues etc.


## Textract Quotas

![image.png](attachment:image.png)

For **Transaction per second (TPS), the default quota is 1 for Singapore region**, but it can be increased upon request by simply opening a ticket using AWS console (https://ap-northeast-1.console.aws.amazon.com/servicequotas/home?region=ap-northeast-1)