## Analyzing Invoice and Receipts Using Textract

Amazon Textract uses machine learning to understand the context of invoices and receipts and automatically extracts data such as invoice or receipt date, invoice or receipt number, item prices, total amount, and payment terms etc. The invoices do not need to be in specific format.

The following is a list of the **standard fields that AnalyzeExpense currently supports**:

- Vendor Name: VENDOR_NAME
- Total: TOTAL
- Receiver Address: RECEIVER_ADDRESS
- Invoice/Receipt Date: INVOICE_RECEIPT_DATE
- Invoice/Receipt ID: INVOICE_RECEIPT_ID
- Payment Terms: PAYMENT_TERMS
- Subtotal: SUBTOTAL
- Due Date: DUE_DATE
- Tax: TAX
- Invoice Tax Payer ID (SSN/ITIN or EIN): TAX_PAYER_ID
- Item Name: ITEM_NAME
- Item Price: PRICE
- Item Quantity: QUANTITY

If the invoice and receipt has other information and you would like to extract those information, you can use Textract Query to ask natural language questions to get the answer.

In [None]:
!pip install boto3 --upgrade
!pip install awscli --upgrade
!pip install botocore --upgrade

In [None]:
!pip install amazon-textract-response-parser --upgrade

In [None]:
!pip install amazon-textract-prettyprinter --upgrade

In [None]:
import boto3
from PIL import Image
import json
import pandas as pd
import time
import requests
import urllib.parse as urlparse
import html

In [None]:
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format

In [None]:
#-- Document
s3BucketName = "my-projects-abhi-2022"   # create a bucket and change to your bucket name
s3=boto3.resource('s3')
region = boto3.session.Session().region_name

#-- Amazon Textract client
textract = boto3.client('textract', region_name=region)

Let's copy some invoices and receipts to our S3 bucket. In the below cell, replace the destination bucket (my-projects-abhi-2022) with your own bucket name.

In [None]:
!aws s3 cp s3://ml-materials/invoices_processing_workshop/ s3://my-projects-abhi-2022/Textract/invoices_processing_workshop/ --recursive

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

documentName = "Textract/invoices_processing_workshop/invoice_0.png"  # change to S3 key of your document and remove the bucket name

# open it directly
with fs.open("s3://" + s3BucketName+"/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

### Using Amazon Textract AnalyzeExpense and Textract Query

In [None]:
%%time

#-- Call Amazon Textract AnalyzeExpense
response_expense = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_expense

In [None]:
pretty_printed_string = get_string(textract_json=response_expense, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

The below code will output the analyze expense API result in Key:Value pair

In [None]:
summary_entities_values = []
summary_fields = []
expense_item = []

for expense_doc in response_expense["ExpenseDocuments"]:
    for field in expense_doc["SummaryFields"]:
        kvs = {}
        if "LabelDetection" in field:
            if "ValueDetection" in field:
                kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
        else:
            kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
        summary_entities_values.append(kvs.copy())
        kvs = None

    for line_item_group in expense_doc["LineItemGroups"]:
            for line_items in line_item_group["LineItems"]:
                for field in line_items["LineItemExpenseFields"]:
                    kvs = {}
                    if "LabelDetection" in field:
                        if "ValueDetection" in field:
                            kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
                    else:
                        kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
                    expense_item.append(kvs.copy())
                    kvs = None
print("Summary Items:\n")
print(*summary_entities_values, sep='\n')
print("\nExpense Items:\n")
print(*expense_item, sep='\n')

In [None]:
response = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        },
                    ]
                }
            )

In [None]:
%store response

In [None]:
#print(json.dumps(response, indent=4))

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]

In [None]:
merchant_address

## Analyzing Receipt

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

s3BucketName = "my-projects-abhi-2022"
documentName = "Textract/invoices_processing_workshop/receipt_0.png"

# open it directly
with fs.open("s3://" + s3BucketName + "/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

In [None]:
#-- Call Amazon Textract AnalyzeExpense
response_receipt = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_receipt

In [None]:
pretty_printed_string = get_string(textract_json=response_receipt, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

Now, we will use **Textract Query** feature to extract some additional information such as Date and Merchant addresss.

In [None]:
response_query_receipt = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the date',
                            'Alias': 'Date'
                        },
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        }
                    ]
                }
            )

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response_query_receipt)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]

In [None]:
address = merchant_address.split(" ")[-2] +' '+ merchant_address.split(" ")[-1]
address

### Extracting Currency from location service

In [None]:
import sagemaker as sm
role = sm.get_execution_role()
role

**Before you can use Amazon location service, you need to ensure that you have added geo search access to SageMaker role**

In [None]:
# Using Amazon Location Service. 
        
client = boto3.client('location')
country_api = client.search_place_index_for_text(IndexName='ExamplePlaceIndex', MaxResults=1, Text=address)
if country_api['Results'] and country_api['Results'][0]['Relevance'] > 0.8:
    country=country_api['Results'][0]['Place']['Country']
    print("country from Amazon location API:", country)
    if country =='USA':
        currency = 'USD'
        print("Currency: ",currency)

### Let's try out another receipt..

In [None]:
from PIL import Image
import s3fs

fs = s3fs.S3FileSystem()

s3BucketName = "my-projects-abhi-2022"
documentName = "Textract/invoices_processing_workshop/receipt_02.png"

# open it directly
with fs.open("s3://" + s3BucketName + "/" + documentName) as f:
    img=Image.open(f)
    basewidth = 1000
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.BICUBIC)
    display(img)

In [None]:
#-- Call Amazon Textract AnalyzeExpense
response_receipt_2 = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
%store response_receipt_2

In [None]:
pretty_printed_string = get_string(textract_json=response_receipt_2, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
print(pretty_printed_string)

In [None]:
response_query_receipt_2 = textract.analyze_document(
            Document={'S3Object': {'Bucket': s3BucketName, 'Name': documentName}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'what is the date',
                            'Alias': 'Date'
                        },
                        {
                            'Text': 'what is the merchant address',
                            'Alias': 'merchant_address'
                        }
                    ]
                }
            )

In [None]:
import trp.trp2 as t2
d = t2.TDocumentSchema().load(response_query_receipt_2)
page = d.pages[0]
query_answers = d.get_query_answers(page=page)
for item in query_answers:
    print(item[1],": ", item[2])
    if item[1]=='merchant_address':
        merchant_address = item[2]