### Step 1 - Read the Information from the Blobs

In this step, we will structure the data retrieved from Azure Document Intelligence. The data will be outputted as a JSON file, and it is our role to process and organize it. Some of the data will be structured into tables, while other data will be formatted as text. This step ensures that the extracted information is organized in a meaningful way for further analysis and usage.

In [55]:
import os
import json
from azure.storage.blob import BlobServiceClient
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

def read_json_files_from_blob(folder_path):
    # Retrieve the connection string from the environment variables
    connection_string = os.getenv('connection_string')

    # Ensure the connection string is not None
    if connection_string is None:
        raise ValueError("The connection string environment variable is not set.")

    # Create a BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)

    # Get the container client
    container_client = blob_service_client.get_container_client("bankdetail")

    # List all blobs in the specified folder
    blob_list = container_client.list_blobs(name_starts_with=folder_path)

    # Filter out JSON files and read their contents
    for blob in blob_list:
        if blob.name.endswith('.json'):
            blob_client = container_client.get_blob_client(blob.name)
            blob_data = blob_client.download_blob().readall()
            data = json.loads(blob_data)
            # print(f"Contents of {blob.name}:")
            # print(json.dumps(data, indent=2))
            # print("\n")

#### Loan Agreements

In [56]:
loanagreement = read_json_files_from_blob("loanagreements")

#### Loan Forms

In [57]:
loanform = read_json_files_from_blob("loanform")

#### Pay Stubs

In [58]:
paystubs = read_json_files_from_blob("paystubs")

### Step 3 - Data Structuring


In this step, you will read JSON data from Azure Blob Storage, clean the data to retain only the text content, and remove any unnecessary formatting such as newlines and spaces. Follow the instructions below to complete this step.

#### Loan Agreements

In [44]:
def clean_json_data(json_data):
    # Extract relevant text content from the JSON
    content = []

    # Extract text from paragraphs
    paragraphs = json_data.get("paragraphs", [])
    for paragraph in paragraphs:
        content.append(paragraph.get("text", "").strip())

    # Extract text from pages and lines
    pages = json_data.get("pages", [])
    for page in pages:
        for line in page.get("lines", []):
            content.append(line.get("text", "").strip())

    # Join all text content into a single string without spaces
    plain_text_content = "".join(content)

    return plain_text_content

In [None]:
# Clean the JSON data
cleaned_data = clean_json_data(loanagreement)

# Print the cleaned data
print(json.dumps(cleaned_data, indent=2))

#### Loan Form

In [53]:
def get_key_value_pairs(result):
    kvp = {}
    pagekvp = {}
    pagelen= len(result.pages)
    pagenum=None
    currpagenum=None
    for kv_pair in result.key_value_pairs:
        if pagenum is None:
            pagenum=kv_pair.key.bounding_regions[0].page_number
        elif (pagenum is not None) and (pagenum != kv_pair.key.bounding_regions[0].page_number):
            pagekvp[pagenum]=kvp
            kvp = {}
            pagenum=kv_pair.key.bounding_regions[0].page_number

        if kv_pair.key:
            if kv_pair.value:
                kvp[kv_pair.key.content] = kv_pair.value.content
    pagekvp[pagenum]=kvp
    return pagekvp

In [54]:
cleaned_data = clean_json_data(loanform)
print(json.dumps(cleaned_data, indent=2))

""
