# Information Extraction from Account Opening forms - Automatic Form Recognition 
<h3><span style="color: #117d30;"> Using Azure Form Recognizer</span></h3>

## Disclaimer

By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code (1) is not subject to SOC 1 and SOC 2 compliance audits, and (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional. Do not use this code to replace, substitute, or provide professional financial advice, or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.

© 2021 Microsoft Corporation. All rights reserved

## Azure Form Recognizer

Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

## Scenario Overview


Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

Account Opening Form Dataset: Raw unstructured data is fed into the pipeline in the form of electronically generated PDFs. This data consists of forms submitted at the time of opening a corporate account with the bank.

### Notebook Organization

- Fetch the account opening PDF files from a container under an azure storage account.
- Convert the PDF files to JSON by querying the azure trained form recognizer model using the REST API.
- Preprocess the JSON files to extract only relevant information.
- Push the JSON files to a container under an azure storage account.

## Disclaimer

By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code (1) is not subject to SOC 1 and SOC 2 compliance audits, and (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional. Do not use this code to replace, substitute, or provide professional financial advice, or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.

© 2021 Microsoft Corporation. All rights reserved

## Importing required libraries

In [1]:
import json
import time
import requests
import os
from azure.storage.blob import ContainerClient
import pprint
import json
from os import listdir
from os.path import isfile, join
import shutil
import time

In [2]:
import os
os.getcwd()

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/fsi-compute-prod/code/Users/demo-fsi-user/notebooks'

## Creating local directories

In [3]:
# Create local directories if they don't exist

# *input_forms* contains all the pdf files
input_path = os.getcwd()+"/input_forms"
output_path = os.getcwd()+"/output_json"

if (not os.path.isdir(input_path)):
    os.makedirs(input_path)

# *output_json* will contain all the converted json files
if (not os.path.isdir(output_path)):
    os.makedirs(output_path)

## Establishing connection to Azure blob storage

In [4]:
import GlobalVariables

In [5]:
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING
CONTAINER_NAME = GlobalVariables.ACC_OPEN_CONTAINER_NAME

# creating blob service object and list blobs inside input_folder

container_client = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)
blobs_list = container_client.list_blobs()

# initializing several lists that will be used in the following cells
blob_client_list=[]
blob_file_list = []

# getting the blob clients and appending them to a list
for c in blobs_list:
    blob_client = container_client.get_blob_client(c)
    blob_file_list.append(c.name)
    blob_client_list.append(blob_client)

for filename, blob_client in zip(blob_file_list, blob_client_list):
    fname = os.path.join(input_path,filename)
    with open(fname, "wb") as blob_file:
        download_stream = blob_client.download_blob()
        download_stream.readinto(blob_file)


## Running Azure Form Recognizer service on forms

We will now send the forms downloaded from Azure Blob Storage to the Form Recognizer service. 

In [6]:
%%time
files = [f for f in listdir(os.getcwd()+"/input_forms") if isfile(join(os.getcwd()+"/input_forms", f))]

# Endpoint parameters for querying the customer trained form-recognizer model to return the processed JSON
# Processes PDF files one by one and return JSON files
endpoint = GlobalVariables.FORM_RECOGNIZER_ENDPOINT
apim_key = GlobalVariables.ACC_OPEN_API_KEY
model_id = "ff158ae7-e3a6-40b3-abd6-a24e9a0e83e3"
post_url = endpoint + "/formrecognizer/v2.1-preview.3/custom/models/%s/analyze" % model_id
params = {"includeTextDetails": True}
headers = {'Content-Type': 'image/jpg', 'Ocp-Apim-Subscription-Key': apim_key}

local_path = input_path


for file in files:
    if not file.lower().endswith(('.png', '.jpg', '.jpeg','.pdf')):
        continue
    
    with open(os.path.join(local_path,file), "rb") as f:
        data_bytes = f.read()
        
    try:
        resp = requests.post(url = post_url, data = data_bytes, headers = headers, params = params)
        print('resp',resp)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % json.dumps(resp.json()))
            quit()
        print("POST analyze succeeded:\n%s" % resp.headers)
        get_url = resp.headers["operation-location"]
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        quit()
     
    n_tries = 10
    n_try = 0
    wait_sec = 5
    max_wait_sec = 60
    while n_try < n_tries:
        try:
            resp = requests.get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
            resp_json = resp.json()
            if resp.status_code != 200:
                print("GET analyze results failed:\n%s" % json.dumps(resp_json))
                quit()
            status = resp_json["status"]
#             print(status)
            output = json.dumps(resp_json)
            
            if status == "succeeded":
                output_dict = json.loads(output)
                        
                print("Analysis succeeded:\n%s \n" % file[:-4])
                
                form_inputs = resp_json['analyzeResult']['documentResults'][0]['fields']
                tags = list(form_inputs.keys())

                temp = {}
                types= ''
                
                for tag in tags: 
                    if form_inputs[tag] != None:
                        types = form_inputs[tag]['type']
                        data = form_inputs[tag]['text']
                        if types == 'selectionMark':
                            if data == 'selected':
                                field = tag.split('_')
                                field_name = field[0]
                                option_chosen = field[-1]
                                
                                if field_name in temp: 
                                    temp_data = temp[field_name]
                                    temp_data.append(option_chosen)
                                    temp[field_name] = temp_data
                                else:
                                
                                    temp[field_name] = [option_chosen]
                            else: 
                                continue
                        else: 
                            temp[tag] = data
                            
                
                    with open(os.path.join(output_path,file[:-4]+".json"), 'w') as outfile:
                        json.dump(temp, outfile)
                break
            if status == "failed":
                print("Analysis failed:\n%s" % json.dumps(resp_json))
                quit()
            # Analysis still running. Wait and retry.
            time.sleep(wait_sec)
            n_try += 1
            wait_sec = min(2*wait_sec, max_wait_sec)     
        except Exception as e:
            msg = "GET analyze results failed:\n%s" % str(e)
            print(msg)
            quit()

resp <Response [202]>
POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://fsiformrecognizerprod.cognitiveservices.azure.com/formrecognizer/v2.1-preview.3/custom/models/ff158ae7-e3a6-40b3-abd6-a24e9a0e83e3/analyzeresults/7baa3a53-0562-40e9-8899-1258f975da21', 'x-envoy-upstream-service-time': '362', 'apim-request-id': '992c7827-716d-4a46-b63c-a33a88f2286c', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Mon, 28 Jun 2021 23:40:12 GMT'}
Analysis succeeded:
Final Test Form 1 

CPU times: user 73.7 ms, sys: 12.5 ms, total: 86.3 ms
Wall time: 20.3 s





## Connection parameters for uploading to Azure blob storage

In [7]:
# Connection paramters for uploading JSON files to blob storage
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING
CONTAINER_NAME = GlobalVariables.ACC_OPEN_OUTPUT_CONTAINER

container_client_upload = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)


## Uploading JSON files to Azure blob storage

In [8]:
# Upload JSON files from local folder *output_json* to the container *azure-ml-output-accountopeningforms*

for pth, dirs, files in os.walk(output_path):
    for filename in files:
        if not filename.lower().endswith(('.json')):
            continue
        
        with open (os.path.join(output_path,filename),'rb') as json_file: 
            blob_client =  container_client_upload.upload_blob(name=filename, data=json_file,overwrite=True)
    