# Information Extraction from Documents for a Bank - Automatic Form Recognition 
<h3><span style="color: #117d30;"> Using Azure Form Recognizer</span></h3>

## Azure Form Recognizer

Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

## Scenario Overview


Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file.

Bank Incidents Form Dataset: Raw unstructured data is fed into the pipeline in the form of electronically generated PDFs. These reports contain information about injuries that occurred at different bank locations.

### Notebook Organization

- Fetch the bank incident  PDF files from a container under an azure storage account.
- Convert the PDF files to JSON by querying the azure trained form recognizer model using the REST API.
- Preprocess the JSON files to extract only relevant information.
- Push the JSON files to a container under an azure storage account.

## Disclaimer

By accessing this code, you acknowledge the code is made available for presentation and demonstration purposes only and that the code (1) is not subject to SOC 1 and SOC 2 compliance audits, and (2) is not designed or intended to be a substitute for the professional advice, diagnosis, treatment, or judgment of a certified financial services professional. Do not use this code to replace, substitute, or provide professional financial advice, or judgement. You are solely responsible for ensuring the regulatory, legal, and/or contractual compliance of any use of the code, including obtaining any authorizations or consents, and any solution you choose to build that incorporates this code in whole or in part.

© 2021 Microsoft Corporation. All rights reserved

## Importing required libraries

In [1]:
import json
import time
import requests
import os
from azure.storage.blob import ContainerClient
import pprint
import json
from os import listdir
from os.path import isfile, join
import shutil
import time

In [2]:
import os
os.getcwd()

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/demo-fsi-user/code/Users'

## Creating local directories

In [3]:
# Create local directories if they don't exist

# *input_forms* contains all the pdf files
input_path = os.getcwd()+"/incident_forms"
output_path = os.getcwd()+"/incident_jsons"

if (not os.path.isdir(input_path)):
    os.makedirs(input_path)

# *output_json* will contain all the converted json files
if (not os.path.isdir(output_path)):
    os.makedirs(output_path)

## Establishing connection to Azure blob storage

In [4]:
import GlobalVariables

In [5]:
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING_INCIDENT_FORMS
CONTAINER_NAME = GlobalVariables.INCIDENT_CONTAINER_NAME

# creating blob service object and list blobs inside input_folder

container_client = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)
blobs_list = container_client.list_blobs()

# initializing several lists that will be used in the following cells
blob_client_list=[]
blob_file_list = []

# getting the blob clients and appending them to a list
for c in blobs_list:
    blob_client = container_client.get_blob_client(c)
    blob_file_list.append(c.name)
    blob_client_list.append(blob_client)

for filename, blob_client in zip(blob_file_list, blob_client_list):
    fname = os.path.join(input_path,filename)
    with open(fname, "wb") as blob_file:
        download_stream = blob_client.download_blob()
        download_stream.readinto(blob_file)

## Running Azure Form Recognizer service on forms

In [6]:
%%time
files = [f for f in listdir(os.getcwd()+"/incident_forms") if isfile(join(os.getcwd()+"/incident_forms", f))]

# Endpoint parameters for querying the customer trained form-recognizer model to return the processed JSON
# Processes PDF files one by one and return JSON files
endpoint = GlobalVariables.FORM_RECOGNIZER_ENDPOINT
apim_key = GlobalVariables.FORM_RECOGNIZER_API_KEY
model_id = GlobalVariables.INCIDENT_FORM_RECOGNIZER_MODEL_ID
post_url = endpoint + "/formrecognizer/v2.1-preview.3/custom/models/%s/analyze" % model_id
params = {"includeTextDetails": True}
headers = {'Content-Type': 'application/pdf', 'Ocp-Apim-Subscription-Key': apim_key}

local_path = input_path

for file in files:
    if not file.lower().endswith(('.png', '.jpg', '.jpeg','.pdf')):
        continue
    
    with open(os.path.join(local_path,file), "rb") as f:
        data_bytes = f.read()
        
    try:
        resp = requests.post(url = post_url, data = data_bytes, headers = headers, params = params)
        print('resp',resp)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % json.dumps(resp.json()))
            quit()
        print("POST analyze succeeded:\n%s" % resp.headers)
        get_url = resp.headers["operation-location"]
        print (get_url)
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        quit()
   
    time.sleep (10)
    resp = requests.get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
    resp_json = resp.json()
    status = resp_json["status"]
#             print(status)
    output = json.dumps(resp_json)
    output_dict = json.loads(output)
    with open(os.path.join(output_path,file[:-4]+".json"), 'w') as outfile:
        json.dump(output_dict, outfile)

resp <Response [202]>
POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://westus2.cognitiveservices.azure.com/formrecognizer/v2.1-preview.3/custom/models/54527eb9-97b9-4755-9c32-1ca96f8815b0/analyzeresults/9b3539e5-4bc2-45a4-83ac-e043bda8062d', 'x-envoy-upstream-service-time': '83', 'apim-request-id': '45634693-05ad-4639-a892-0e571d62dab7', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Tue, 03 Aug 2021 01:16:12 GMT'}
https://westus2.cognitiveservices.azure.com/formrecognizer/v2.1-preview.3/custom/models/54527eb9-97b9-4755-9c32-1ca96f8815b0/analyzeresults/9b3539e5-4bc2-45a4-83ac-e043bda8062d
resp <Response [202]>
POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://westus2.cognitiveservices.azure.com/formrecognizer/v2.1-preview.3/custom/models/54527eb9-97b9-4755-9c32-1ca96f8815b0/analyzeresults/b9798cd7-a2b7-4603-9e36-f04223810e52', 'x-envoy-upstream-service-




## Connection parameters for uploading to Azure blob storage

In [7]:
# Connection paramters for uploading JSON files to blob storage
CONNECTION_STRING = GlobalVariables.STORAGE_ACCOUNT_CONNECTION_STRING
CONTAINER_NAME = GlobalVariables.OUTPUT_CONTAINER_NAME

container_client_upload = ContainerClient.from_connection_string(conn_str=CONNECTION_STRING, container_name=CONTAINER_NAME)


## Uploading JSON files to Azure blob storage

In [8]:
# Upload JSON files from local folder *output_json* to the container *formrecogoutput*

for pth, dirs, files in os.walk(output_path):
    for filename in files:
        with open (os.path.join(output_path,filename),'rb') as json_file: 
            blob_client =  container_client_upload.upload_blob(name=filename, data=json_file,overwrite=True)
    

ResourceNotFoundError: The specified container does not exist.
RequestId:b193ab76-d01e-0046-6705-88dcf2000000
Time:2021-08-03T01:18:09.5469188Z
ErrorCode:ContainerNotFound
Error:None