# Document Processing with Azure

This notebook demonstrates how to upload data to Azure Blob Storage, process the data using Azure Document Intelligence, and retrieve the information from these documents in JSON format. The steps include:

1. **Upload Data to Blob Storage**: We will upload documents to Azure Blob Storage for processing.
2. **Process Data with Azure Document Intelligence**: Utilize Azure's Document Intelligence capabilities to analyze and extract information from the uploaded documents.
3. **Retrieve Information in JSON Format**: Extract and retrieve the processed information in JSON format for further use.

## Importance of Document Processing

Automating document processing is crucial for improving efficiency and accuracy in handling large volumes of data. By leveraging Azure's cloud services, organizations can streamline their workflows, reduce manual errors, and gain valuable insights from their documents. This approach not only saves time and resources but also enhances data accessibility and decision-making capabilities.

### Step 1- Upload Data to Blob

In this step, we will upload our documents to Azure Blob Storage, which serves as a scalable and secure storage solution for our data. The data used is stored in this repo's *data* folder.

In [19]:
import os
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient

# Load environment variables from .env file
load_dotenv()

# Retrieve the connection string and data folder from the environment variables
connection_string = os.getenv('connection_string')
data_folder = os.getenv('data_folder')
container_name = os.getenv('container_name')

# Ensure the connection string, data folder, and container name are not None
if connection_string is None:
    raise ValueError("The connection string environment variable is not set.")
if data_folder is None:
    raise ValueError("The data folder environment variable is not set.")
if container_name is None:
    raise ValueError("The container name environment variable is not set.")

# Ensure the data folder exists
if not os.path.isdir(data_folder):
    raise FileNotFoundError(f"The specified data folder does not exist: {data_folder}")

# Create a BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# Upload files in the data folder and its subdirectories to the blob container
for root, dirs, files in os.walk(data_folder):
    for filename in files:
        file_path = os.path.join(root, filename)
        if os.path.isfile(file_path):
            # Create a blob path that maintains the directory structure
            blob_path = os.path.relpath(file_path, data_folder).replace("\\", "/")
            blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_path)
            with open(file_path, "rb") as data:
                blob_client.upload_blob(data, overwrite=True)
            print(f"Uploaded {blob_path} to blob storage.")

Uploaded readme.md to blob storage.
Uploaded loanagreements/la_janesmith.pdf to blob storage.
Uploaded loanform/lpjohndoe.pdf to blob storage.
Uploaded loanform/lp_janesmith.pdf to blob storage.
Uploaded paystubs/paystubjanesmith.pdf to blob storage.
Uploaded paystubs/paystubjohndoe.pdf to blob storage.


### Step 2 - Process Data with Azure Document Intelligence

In this step, we will use Azure Document Intelligence to analyze and extract information from the uploaded documents. The code demonstrates how to authenticate with Azure, submit a document for analysis, and retrieve the results, including detected languages, lines, words, and paragraphs.

In [39]:
import os
import json
from azure.storage.blob import BlobServiceClient, generate_blob_sas, BlobSasPermissions
from datetime import datetime, timedelta
from dotenv import find_dotenv, load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult, AnalyzeDocumentRequest

def generate_sas_url(blob_service_client, container_name, blob_name, expiry_hours=1):
    """
    Generate a SAS URL for a blob in Azure Blob Storage.

    :param blob_service_client: BlobServiceClient instance
    :param container_name: Name of the container
    :param blob_name: Name of the blob
    :param expiry_hours: Expiry time in hours for the SAS token
    :return: SAS URL for the blob
    """
    sas_token = generate_blob_sas(
        account_name=blob_service_client.account_name,
        container_name=container_name,
        blob_name=blob_name,
        account_key=blob_service_client.credential.account_key,
        permission=BlobSasPermissions(read=True),
        expiry=datetime.utcnow() + timedelta(hours=expiry_hours)
    )

    sas_url = f"https://{blob_service_client.account_name}.blob.core.windows.net/{container_name}/{blob_name}?{sas_token}"
    return sas_url

def get_words(page, line):
    result = []
    for word in page.words:
        if _in_span(word, line.spans):
            result.append(word)
    return result

def _in_span(word, spans):
    for span in spans:
        if word.span.offset >= span.offset and (word.span.offset + word.span.length) <= (span.offset + span.length):
            return True
    return False

def bounding_region_to_dict(region):
    return {
        "page_number": region.page_number,
        "bounding_box": [point for point in region.bounding_box]
    }

def analyze_read(blob_url):
    # For how to obtain the endpoint and key, please see PREREQUISITES above.
    endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
    key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]

    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

    # Analyze a document from the blob URL
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-read",
        AnalyzeDocumentRequest(url_source=blob_url),
        features=[DocumentAnalysisFeature.LANGUAGES]
    )
    
    result: AnalyzeResult = poller.result()
    
    # Collect analysis results
    analysis_results = {
        "languages": [],
        "pages": [],
        "paragraphs": []
    }

    # Detect languages.
    if result.languages is not None:
        for language in result.languages:
            analysis_results["languages"].append({
                "locale": language.locale,
                "confidence": language.confidence
            })
    
    # Analyze pages.
    for page in result.pages:
        page_info = {
            "page_number": page.page_number,
            "width": page.width,
            "height": page.height,
            "unit": page.unit,
            "lines": []
        }

        # Analyze lines.
        if page.lines:
            for line_idx, line in enumerate(page.lines):
                words = get_words(page, line)
                line_info = {
                    "line_idx": line_idx,
                    "words": [{"content": word.content, "confidence": word.confidence} for word in words],
                    "text": line.content,
                    "polygon": line.polygon
                }
                page_info["lines"].append(line_info)
        
        analysis_results["pages"].append(page_info)
        
    # Analyze paragraphs.
    if result.paragraphs:
        for paragraph in result.paragraphs:
            analysis_results["paragraphs"].append({
                "bounding_regions": [bounding_region_to_dict(region) for region in paragraph.bounding_regions],
                "content": paragraph.content
            })

    return analysis_results

def save_analysis_results(blob_service_client, container_name, blob_name, analysis_results):
    """
    Save the analysis results to the same blob storage.

    :param blob_service_client: BlobServiceClient instance
    :param container_name: Name of the container
    :param blob_name: Name of the blob
    :param analysis_results: Analysis results to save
    """
    container_client = blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(blob_name + ".json")

    # Convert analysis results to JSON string
    analysis_results_json = json.dumps(analysis_results, indent=2)

    # Upload the JSON string to the blob
    blob_client.upload_blob(analysis_results_json, overwrite=True)

def process_blob_documents():
    # Load environment variables from .env file
    load_dotenv(find_dotenv())

    # Retrieve the connection string and container name from the environment variables
    connection_string = os.getenv('connection_string')
    container_name = os.getenv('container_name')

    # Ensure the connection string is not None
    if connection_string is None:
        raise ValueError("The connection string environment variable is not set.")

    # Create a BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)

    # List and process documents in the specified container
    container_client = blob_service_client.get_container_client(container_name)
    blob_list = container_client.list_blobs()

    supported_formats = {".pdf", ".jpeg", ".jpg", ".png", ".tiff", ".bmp"}

    for blob in blob_list:
        file_extension = os.path.splitext(blob.name)[1].lower()
        
        if file_extension in supported_formats:
            sas_url = generate_sas_url(blob_service_client, container_name, blob.name)
            print(f"Processing document: {sas_url}")
            try:
                analysis_results = analyze_read(sas_url)
                save_analysis_results(blob_service_client, container_name, blob.name, analysis_results)
            except Exception as e:
                print(f"Failed to process document {blob.name}: {e}")
        else:
            print(f"Skipping unsupported file format: {blob.name}")

if __name__ == "__main__":
    from azure.core.exceptions import HttpResponseError

    try:
        process_blob_documents()
    except HttpResponseError as error:
        # Examples of how to check an HttpResponseError
        # Check by error code:
        if error.error is not None:
            if error.error.code == "InvalidImage":
                print(f"Received an invalid image error: {error.error}")
            if error.error.code == "InvalidRequest":
                print(f"Received an invalid request error: {error.error}")
            # Raise the error again after printing it
            raise
        # If the inner error is None and then it is possible to check the message to get more information:
        if "Invalid request".casefold() in error.message.casefold():
            print(f"Uh-oh! Seems there was an invalid request: {error}")
        # Raise the error again
        raise

Processing document: https://stgweaihack.blob.core.windows.net/bankdetail/loanagreements/la_janesmith.pdf?se=2024-09-02T16%3A49%3A12Z&sp=r&sv=2021-08-06&sr=b&sig=XpFmy1V2azh8mmTRzQUBlYy%2BOEafG9Zvw1SHA7O25Ms%3D
Failed to process document loanagreements/la_janesmith.pdf: 'BoundingRegion' object has no attribute 'bounding_box'
Processing document: https://stgweaihack.blob.core.windows.net/bankdetail/loanform/lp_janesmith.pdf?se=2024-09-02T16%3A49%3A18Z&sp=r&sv=2021-08-06&sr=b&sig=kXquo9ZOsiuhE8SJcXkc3jG5h1%2Bi915KCP/9Ux/Kq0c%3D
Failed to process document loanform/lp_janesmith.pdf: 'BoundingRegion' object has no attribute 'bounding_box'
Processing document: https://stgweaihack.blob.core.windows.net/bankdetail/loanform/lpjohndoe.pdf?se=2024-09-02T16%3A49%3A22Z&sp=r&sv=2021-08-06&sr=b&sig=jYfSEQSKbQv6IIYrKTyiJh4SLHN983OgxLDCuAvitEU%3D


KeyboardInterrupt: 

### Step 3 - Retrieve Information in JSON Format

In this step, we will extract and retrieve the processed information from the documents in JSON format. This will allow us to further analyze and utilize the extracted data for various purposes.

In [None]:
from azure.ai.documentintelligence import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
import json


# Create a DocumentAnalysisClient
document_analysis_client = DocumentAnalysisClient(endpoint=os.getenv("DOCUMENTINTELLIGENCE_ENDPOINT"), credential=AzureKeyCredential(os.getenv("DOCUMENTINTELLIGENCE_API_KEY")))

# Function to analyze a document from blob storage
def analyze_document(blob_url):
    poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-document", blob_url)
    result = poller.result()
    return result

# Retrieve and process documents from blob storage
container_client = blob_service_client.get_container_client(os.getenv("container_name"))
blob_list = container_client.list_blobs()

for blob in blob_list:
    blob_url = f"https://{blob_service_client.account_name}.blob.core.windows.net/{os.getenv("container_name")}/{blob.name}"
    result = analyze_document(blob_url)
    
    # Convert result to JSON format
    result_json = result.to_dict()
    print(json.dumps(result_json, indent=2))

### Step 4 - Structure the Retrieved Data (ou por no Blob os jsons?)

In this step, we will structure the data retrieved from Azure Document Intelligence. The data will be outputted as a JSON file, and it is our role to process and organize it. Some of the data will be structured into tables, while other data will be formatted as text. This step ensures that the extracted information is organized in a meaningful way for further analysis and usage.