# **Information Extraction from forms - Automatic Form Recognition**

#### Azure Form Recognizer

Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.


## **Scenario Overview**

Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

Account Opening Form Dataset: Raw unstructured data is fed into the pipeline in the form of electronically generated PDFs. This data consists of forms submitted at the time of opening a corporate account with the bank.

### Notebook Organization

Fetch the account opening PDF files from a container under an azure storage account.

Convert the PDF files to JSON by querying the azure trained form recognizer model using the REST API.

Preprocess the JSON files to extract only relevant information.

Push the JSON files to a container under an azure storage account.

### In this notebook we are going to deploy a custom form recognizer model that has already been trained.  First we want to import our packages.


## Importing required libraries

In [1]:
!pip install azure-ai-formrecognizer



In [2]:
# import packages 
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
import os
import pandas as pd
import json
from azure.storage.blob import ContainerClient

#### Next, we want to import os and get the filepath to the current working directory.

In [3]:
pip --version

pip 20.1.1 from /anaconda/envs/azureml_py36/lib/python3.6/site-packages/pip (python 3.6)
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pkg_resources
for d in pkg_resources.working_set:
     print(d)

zope.interface 5.4.0
zope.event 4.5.0
zipp 3.6.0
zict 2.0.0
yellowbrick 1.3.post1
yarl 1.7.2
yapf 0.31.0
xxhash 2.0.2
xmltodict 0.12.0
xgboost 0.90
wrapt 1.12.1
wordcloud 1.8.1
widgetsnbextension 3.5.2
wheel 0.35.1
Werkzeug 2.0.2
websockets 9.1
websocket-client 1.2.1
webencodings 0.5.1
wcwidth 0.2.5
wasabi 0.8.2
waitress 2.0.0
vsts 0.1.25
vsts-cd-manager 1.0.2
visions 0.7.4
uuid 1.30
urllib3 1.25.11
umap-learn 0.5.2
ujson 4.2.0
typing-extensions 3.10.0.2
typed-ast 1.5.0
transformers 4.5.1
traitlets 4.3.3
tqdm 4.62.3
tornado 6.1
torchvision 0.9.1
torch 1.10.0
torch-tb-profiler 0.3.1
toolz 0.11.1
toml 0.10.2
tokenizers 0.10.3
tifffile 2020.9.3
threadpoolctl 2.1.0
thinc 7.0.8
textwrap3 0.9.2
textblob 0.17.1
testpath 0.5.0
terminado 0.12.1
termcolor 1.1.0
tensorflow 2.1.0
tensorflow-gpu 2.1.0
tensorflow-estimator 2.1.0
tensorboardX 2.4
tensorboard 2.1.1
tensorboard-plugin-wit 1.8.0
tensorboard-data-server 0.6.1
tenacity 8.0.1
tblib 1.7.0
tangled-up-in-unicode 0.1.0
tabulate 0.8.9
sympy 1.9

autopep8 1.6.0
autokeras 1.0.16.post1
attrs 21.2.0
asynctest 0.13.0
async-timeout 4.0.1
astunparse 1.6.3
astroid 2.8.5
astor 0.8.1
argon2-cffi 21.1.0
argcomplete 1.12.3
arch 4.14
applicationinsights 0.11.10
anyio 3.3.4
antlr4-python3-runtime 4.7.2
ansiwrap 0.8.4
alembic 1.4.1
aiosignal 1.2.0
aioredis 1.3.1
aiohttp 3.8.1
aiohttp-cors 0.7.0
adal 1.2.7
absl-py 0.15.0


In [5]:
import os
os.getcwd()

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/asma1/code/Users/asma'

#### This code will create filepath variables to local directories for input forms to be recognized and the output values in the form of json files. If the folders do not exist, this code will create the folders first.

## Creating local directories

In [6]:
# Create local directories if they don't exist

# *input_forms* contains all the pdf files
input_path = os.getcwd()+"/input_forms"
output_path = os.getcwd()+"/output_json"

if (not os.path.isdir(input_path)):
    os.makedirs(input_path)

# *output_json* will contain all the converted json files
if (not os.path.isdir(output_path)):
    os.makedirs(output_path)

print(input_path)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/asma1/code/Users/asma/input_forms


#### Here we are importing global variables for our enpoints, keys, credentitals and ids. These are all kept on a seperate file because they are to be kept secret. We are also creating the form recognizer client variable.

## Establishing an Azure Blob Storage Connection

In [7]:
import GlobalVariables
CONTAINER_NAME = GlobalVariables.INPUT_CONTAINER
CONNECTION_STRING = GlobalVariables.CONNECTION_STRING_third
endpoint = GlobalVariables.endpoint
credential = AzureKeyCredential(GlobalVariables.credential)
model_id = GlobalVariables.model_id
form_recognizer_client = FormRecognizerClient(endpoint, credential)

#### This block of code uses the global variables to connect to blob storage.  We will transfer the local output files to blob storage after they are all done processing.

In [8]:
# creating blob service object and list blobs inside input_folder

container_client = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)
blobs_list = container_client.list_blobs()

# initializing several lists that will be used in the following cells
blob_client_list=[]
blob_file_list = []

# getting the blob clients and appending them to a list
for c in blobs_list:
    blob_client = container_client.get_blob_client(c)
    blob_file_list.append(c.name)
    blob_client_list.append(blob_client)

for filename, blob_client in zip(blob_file_list, blob_client_list):
    fname = os.path.join(input_path,filename)
    with open(fname, "wb") as blob_file:
        download_stream = blob_client.download_blob()
        download_stream.readinto(blob_file)



#### This code is where the form gets sent through the form recognizer and returns the output values.  That output is then transformed into a dataframe and ultimately a json file.  The json file will make it easier for front-end developers to display in a web browser.

## Running Form Recognizer Service on the Forms

In [9]:
# iterate through folder and run each file through the form recognizer and saves data to lists. Also check for ignore files.

files = os.listdir(input_path)
for filename in files:
    values = []
    data = []
    if filename.endswith(".json"):
        os.remove(os.path.join(input_path, filename))
        break
    if filename == '.amlignore' or filename == '.amlignore.amltmp':
        continue
    print(filename)
    with open(os.path.join(input_path, filename), "rb") as fd:
        form = fd.read()
        poller = form_recognizer_client.begin_recognize_custom_forms(model_id=model_id, form=form, include_field_elements=True, content_type="application/pdf")
        result = poller.result()
    for recognized_form in result:
        for name, field in recognized_form.fields.items():
            value = field.value
            confidence = field.confidence
            bounding_box = []
            bbox = field.value_data
            try:
                # bounding box data is stored as (x,y) for all 4 corners . The bounding box corner coordinates are available to developers for the purpose of creating boxes around the recognized values. Coordinates are in inches because files is pdf.
                bounding_box = bbox.__dict__
                point_coord = bounding_box.get('bounding_box')
                bottom_left_x = point_coord[0][0]
                bottom_left_y = point_coord[0][1]
                bottom_right_x = point_coord[1][0]
                bottom_right_y = point_coord[1][1]
                top_right_x = point_coord[2][0]
                top_right_y = point_coord[2][1]
                top_left_x = point_coord[3][0]
                top_left_y = point_coord[3][1]
            except:
                bottom_left_x, bottom_left_y, bottom_right_x, bottom_right_y, top_right_x, top_right_y, top_left_x, top_left_y = "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"
            data = []
            data = [name,value,confidence,bottom_left_x,bottom_left_y,bottom_right_x,bottom_right_y,top_right_x,top_right_y,top_left_x,top_left_y]
            values.append(data)
            
    # convert values list to pandas dataframe
    df = pd.DataFrame(values[1:],columns=['Name','Value','Confidence','Bottom Left X','Bottom Left Y','Bottom Right X','Bottom Right Y','Top Right X','Top Right Y','Top Left X','Top Left Y'])
   
    # covert dataframe to json and save json in local output folder then close file
    filename = os.path.splitext(filename)[0] +'.json'
    result = df.to_json(os.path.join(output_path, filename),orient="split")
    fd.close()
    # os.remove(os.path.join(input_path, filename))
    del df

## Create connection parameters for uploading json files to blob storage and then uploading all the json files in the folder.

In [10]:
# Connection paramters for uploading JSON files to blob storage
container_client_upload = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)

# Upload JSON files from local folder *output_json* to the container *formrecogoutput*
for pth, dirs, files in os.walk(output_path):
    for filename in files:
        if filename == '.amlignore' or filename == '.amlignore.amltmp':
            continue
        with open (os.path.join(output_path,filename),'rb') as json_file: 
            blob_client =  container_client_upload.upload_blob(name=filename, data=json_file,overwrite=True)
            os.remove(os.path.join(output_path, filename))
