## Import all the necessary modules and global variables

In [46]:
from docai_module.config import *

# [Document AI](https://console.cloud.google.com/ai/document-ai)

Document AI uses advanced machine learning techniques to extract data and structure from files.  
In the past, capturing this unstructured has been an expensive, time-consuming, and error-prone process requiring manual data entry. Today, AI and machine learning have made great advances towards automating this process, enabling businesses to derive insights from, and take advantage of, this data that had been previously untapped.

The goal of this lab is to explore how to use this powerful tool.

## Lab details

 - Process a PDF Document using Document AI Form Parser (online and batch methods)
 - Explore the results for OCR, Table Recognition and Key/Value pairs

## Create a Processor
Before start processing your documents, you need to create a specific processor with the type of processing we need to perform.    
For this example, we will create a Form Parser processor which can parse Tables, do OCR, extract key / value pairs, etc.

 - Cloud Console UI > Navigate to Document AI

<img src="./images/1_1_menu.jpg"
     alt="Menu"
     style="width:15%"
     />

 - Click on Create Processor
 
<img src="./images/1_2_processor.png"
     alt="Processor"
     style="width:30%"
     />

 - There are several options for processors. Select the Form Parser.
 
<img src="./images/1_3_form_parser.png"
     alt="Form Parser"
     style="width:40%"
     />

 - Give the name "document_processing", choose the US location and click CREATE

<img src="./images/1_4_name_processor.png"
     alt="Name Processor"
     style="width:30%"
     />

 - On the left menu, click on Processors and click on the processor you just created

<img src="./images/1_5_find_processor.png"
     alt="Created Processor"
     style="width:30%"
     />

 - Take note on the ID of the processor and fill this information in the next cell of the notebook.
  - Your ID will be different than the shown in this picture.

<img src="./images/1_6_processor_id.png"
     alt="Processor ID"
     style="width:30%"
     />

In [12]:
# Fill this variable with the ID of the processor created in the previews steps
PROCESSOR_ID = 'ff4bad3352769404'

### Quick Test using the UI

After creating your processor you can test it in the Processors > Details page.

 - Save a local copy of the sample form document [gs://cloud-samples-data/documentai/loan_form.pdf](https://storage.googleapis.com/cloud-samples-data/documentai/loan_form.pdf)  
 This document is stored in a publicly accessible Cloud Storage bucket.

 - In the "Test your processor" section of the processor detail page, upload the local version of the sample form.  
 This takes you to the document analysis page where you can view the document annotations returned. Click on UPLOAD DOCUMENT and select the document you just downloaded.

<img src="./images/1_7_test_processor.png"
     alt="Test Processor"
     style="width:30%"
     />
     
 - The result of this processing looks like this
 
<img src="./images/1_8_result_form.png"
     alt="Test Processor"
     style="width:50%"
     />
     
Document AI Form Parser is capable of detecting Tables, Checkboxes, OCR, Key/Value pairs, etc. It is a very powerful tool to process documents.  
Let's explore how can we use these features programatically using client libraries.

## Calling Document AI

Currently Document AI can be used for PDF, TIFF and GIF files.  
There are two ways you can call the API: online/synchronous and offline/asynchronous/batch.

 - Online processing support smaller files and return the results immediately to the caller.
 - Offline processing returns an ID and finished the processing in the background.

Using one of these methods to call the API depends on how you are designing your system.  
Next, we will discuss how to use both methods and how to interpret the result from the API calling.

## Small file online processing

Synchronous ("online") requests target a document with a small number of pages and size.  
Synchronous requests immediately return a response inline.  
The following function calls the Document AI API directly to process a PDF and return a JSON with the results.  

<img src="./images/1_arch_online.png"
     alt="Online Processing"
     style="width:30%"
     />

Let's process this document with Document AI Form Parser.  
According to the documentation, currently Document AI with online processing supports both an URI in the format of gs://..., or a local file.  

https://cloud.google.com/document-ai/docs/reference/rpc/google.cloud.documentai.v1beta3#document

In the following example, let's define a function to process a local file.

In [13]:
def process_document_online(file_path: str, mime_type: str):
    # The full resource name of the processor
    name = f"projects/{PROJECT_ID}/locations/{DOCAI_LOCATION}/processors/{PROCESSOR_ID}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory and configure the process request
    document = {"content": image_content, "mime_type": mime_type}
    request = {"name": name, "document": document}

    # Recognizes text entities in the PDF document
    return DOCAI_CLIENT.process_document(request=request)

In [14]:
# Define the file path
file_path = './files/loan_form.pdf'
response = process_document_online(file_path, MIME_TYPE)

In [15]:
# Document Text
response.document.text

'Loan Agreement Form\nAgreement Number:\n0123456789\nAgreement date:\n01/01/2020\nThis loan agreement is commenced between the parties:\nMortgage company contact details:\nName:\nMortgage company A\nAddress:\n100 Franklin Street, Mountain View, CA, 94035\nPhone number: 1-800-843-8623\n(hereinafter referred to as the lender)\nIndividual details:\nName:\nArjun Patel\nMarital status:\nSingle\nMarried ☐\nOther\nAddress:\n500 Castro Street, Mountain View, CA 94035\nPhone number: 650-987-0934\n(hereinafter referred to as the borrower)\n[Fill in all details as per instructions]\n6.0\n%.\nThe lender is ready to sanction $ 2000 as the loan amount at\n[Total loan amount along with the agreed percentage rate].\nThis loan agreement is valid from 01/01/2020 and is ending on 12/31/2020.\nTerms & agreements:\n38.67\nper month for\n5\nyears.\nThe borrower will pay an installment of $\n[Amount & tenure of loan]\nAny late installment will be accepted with $\n40\nas a fine.\n'

In [16]:
# Document Name / Values
response.document.pages[0].form_fields[0]

field_name {
  text_anchor {
    text_segments {
      start_index: 20
      end_index: 38
    }
  }
  confidence: 0.9999979138374329
  bounding_poly {
    normalized_vertices {
      x: 0.12571103870868683
      y: 0.11560439318418503
    }
    normalized_vertices {
      x: 0.2946530282497406
      y: 0.11560439318418503
    }
    normalized_vertices {
      x: 0.2946530282497406
      y: 0.13230769336223602
    }
    normalized_vertices {
      x: 0.12571103870868683
      y: 0.13230769336223602
    }
  }
  orientation: PAGE_UP
}
field_value {
  text_anchor {
    text_segments {
      start_index: 38
      end_index: 49
    }
  }
  confidence: 0.9999979138374329
  bounding_poly {
    normalized_vertices {
      x: 0.3122866749763489
      y: 0.117802195250988
    }
    normalized_vertices {
      x: 0.4084186553955078
      y: 0.117802195250988
    }
    normalized_vertices {
      x: 0.4084186553955078
      y: 0.13010989129543304
    }
    normalized_vertices {
      x: 0.31228667

In [17]:
def print_doc_result(document: documentai.types.document.Document):
    for page in document.pages:
        for form_field in page.form_fields:
            field_name = _get_text(form_field.field_name, document)
            field_value = _get_text(form_field.field_value, document)
            print("Extracted key value pair:")
            print(f"\t{field_name} {field_value}")
        for paragraph in document.pages:
            paragraph_text = _get_text(paragraph.layout, document)
            print(f"Paragraph text:\n{paragraph_text}")

# Extract shards from the text field
def _get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

In [19]:
# Print Name/Values and Paragraphs
# There are other elements like Tables, Lines, Blocks, etc.
print_doc_result(response.document)

Extracted key value pair:
	Agreement Number:
 0123456789

Extracted key value pair:
	Address:
 100 Franklin Street, Mountain View, CA, 94035

Extracted key value pair:
	Agreement date:
 01/01/2020

Extracted key value pair:
	Phone number:  1-800-843-8623

Extracted key value pair:
	Address:
 500 Castro Street, Mountain View, CA 94035

Extracted key value pair:
	Other
 
Extracted key value pair:
	Married  ☐

Extracted key value pair:
	Name:
 Arjun Patel

Extracted key value pair:
	Phone number:  650-987-0934

Extracted key value pair:
	per month for
 5
years.

Extracted key value pair:
	Single
 
Extracted key value pair:
	The borrower will pay an installment of $
 38.67

Extracted key value pair:
	The lender is ready to sanction $  2000 
Extracted key value pair:
	as the loan amount at
 6.0
%
Extracted key value pair:
	Name:
 Mortgage company A

Extracted key value pair:
	Any late installment will be accepted with $
 40
as a 
Extracted key value pair:
	This loan agreement is valid from 

## Large file offline processing

Asynchronous ("offline") requests target longer documents.  
These types of requests start a long-running operations. When this operation finishes it stores output as a JSON file in a specified Cloud Storage bucket.

Document AI asynchronous processing accepts PDF, TIFF, GIF files up to 2000 pages. Attempting to process larger files returns an error.

<img src="./images/1_arch_offline.png"
     alt="Offline Processing"
     style="width:30%"
     />
     
To process documents in batch using the offline method, we need to create a bucket in Cloud Storage to retrive the PDF and write the API call result.  
Let's use the TEST_BUCKET to upload a document and test the async call.

In [20]:
print(f'TEST_BUCKET: gs://{TEST_BUCKET}/')

TEST_BUCKET: gs://cool-ml-demos-test/


Let's copy the PDF to execute in batch (offline).

In [26]:
# Function to upload the PDF to a bucket
def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket."""
    bucket = STORAGE_CLIENT.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_filename(source_file_name)

    print(
        "File {} uploaded to {}.".format(
            source_file_name, destination_blob_name
        )
    )

In [27]:
# Call the function to upload the file to the bucket
upload_blob(TEST_BUCKET, './files/loan_form.pdf', 'loan_form.pdf')

File ./files/loan_form.pdf uploaded to loan_form.pdf.


Go back to the Google Cloud web console, select your bucket (click on its name) and check if the PDF was uploaded, like the picture below.

<img src="./images/1_10_storage.png"
     alt="Cloud Storage"
     style="width:70%"
     />

Ok, now let's define a method to process the document offline.

In [32]:
# Process a document async
def async_process_document(
    gcs_input_uri: str,
    gcs_output_uri: str,
    gcs_output_uri_prefix: str
) -> str:
    # 'mime_type' can be 'application/pdf', 'image/tiff',
    # and 'image/gif', or 'application/json'
    service = documentai.types.document_processor_service
    input_config = service.BatchProcessRequest.BatchInputConfig(
        gcs_source=gcs_input_uri, mime_type=MIME_TYPE
    )

    # Where to write results
    destination_uri = f'{gcs_output_uri}/{gcs_output_uri_prefix}/'
    output_config = service.BatchProcessRequest.BatchOutputConfig(
        gcs_destination=destination_uri
    )

    # Call Processor to process document
    name = f"projects/{PROJECT_ID}/locations/{DOCAI_LOCATION}/processors/{PROCESSOR_ID}"
    request = service.BatchProcessRequest(
        name=name,
        input_configs=[input_config],
        output_config=output_config,
    )
    operation = DOCAI_CLIENT.batch_process_documents(request)

    return operation

In [37]:
# Define the input URI, destination and prefix (folder)
gcs_input_uri = f'gs://{TEST_BUCKET}/loan_form.pdf'
gcs_output_uri = f'gs://{TEST_BUCKET}'
gcs_output_uri_prefix = 'results'
destination_uri = f'{gcs_output_uri}/{gcs_output_uri_prefix}/'

In [34]:
# Call the operation
operation = async_process_document(gcs_input_uri, gcs_output_uri, gcs_output_uri_prefix)

In [45]:
# Get the details of this operation
# Important to note the "name", which identifies the op
# It is possible to retrieve the status of an operation with its name
operation.operation

name: "projects/411150075841/locations/us/operations/2508786828607454634"
metadata {
  type_url: "type.googleapis.com/google.cloud.documentai.v1beta3.BatchProcessMetadata"
  value: "\010\003\022$Processed 1 document(s) successfully\032\014\010\352\212\314\200\006\020\310\363\376\203\003\"\014\010\242\213\314\200\006\020\250\237\245\362\001*^\n%gs://cool-ml-demos-test/loan_form.pdf\0325gs://cool-ml-demos-test/results/2508786828607454634/0"
}
done: true
response {
  type_url: "type.googleapis.com/google.cloud.documentai.v1beta3.BatchProcessResponse"
}

In [36]:
# Wait for the processing to conclude (~1 minute)
operation.result(timeout=300)



In [39]:
# Results are written to GCS. Use a regex to find output files
match = re.match(r"gs://([^/]+)/(.+)", destination_uri)
output_bucket = match.group(1)
prefix = match.group(2)

bucket = STORAGE_CLIENT.get_bucket(output_bucket)
blob_list = list(bucket.list_blobs(prefix=prefix))

print(blob_list)

[<Blob: cool-ml-demos-test, results/2508786828607454634/0/loan_form-0.json, 1611859362210631>]


In [40]:
def print_offline_results(blob_list: list):       
    print("Output files:")
    for i, blob in enumerate(blob_list):
        # Download the contents of this blob as a bytes object.
        if '.json' not in blob.name:
            print(f"skipping non-supported file type {blob.name}")
            break
        # Only parses JSON files
        blob_as_bytes = blob.download_as_bytes()
        document = documentai.types.Document.from_json(blob_as_bytes)
        print(f"Fetched file {i + 1}")

        # Read the text recognition output from the processor
        print_doc_result(document)

In [41]:
print_offline_results(blob_list)

Output files:
Fetched file 1
Extracted key value pair:
	Agreement Number:
 0123456789

Extracted key value pair:
	Address:
 100 Franklin Street, Mountain View, CA, 94035

Extracted key value pair:
	Agreement date:
 01/01/2020

Extracted key value pair:
	Phone number:  1-800-843-8623

Extracted key value pair:
	Address:
 500 Castro Street, Mountain View, CA 94035

Extracted key value pair:
	Other
 
Extracted key value pair:
	Married  ☐

Extracted key value pair:
	Name:
 Arjun Patel

Extracted key value pair:
	Phone number:  650-987-0934

Extracted key value pair:
	per month for
 5
years.

Extracted key value pair:
	Single
 
Extracted key value pair:
	The borrower will pay an installment of $
 38.67

Extracted key value pair:
	The lender is ready to sanction $  2000 
Extracted key value pair:
	as the loan amount at
 6.0
%
Extracted key value pair:
	Name:
 Mortgage company A

Extracted key value pair:
	Any late installment will be accepted with $
 40
as a 
Extracted key value pair:
	This 

## IMPORTANT: Execute the next two cells once

In [None]:
%%capture output --no-stderr
print(f'PROCESSOR_ID = \'{PROCESSOR_ID}\'')

In [None]:
with open('./docai_module/config.py', 'a') as f:
    f.write(output.stdout)

# Challange

In the following bucket there is a PDF file with a table in it.
Can you tell what is the header of this table? (use the Document AI Form Parser)

> gs://cloud-samples-data/documentai/invoice.pdf

In [None]:
# Start developing here
