# Document Processing with Azure

This notebook demonstrates how to upload data to Azure Blob Storage, process the data using Azure Document Intelligence, and retrieve the information from these documents in JSON format. The steps include:

1. **Upload Data to Blob Storage**: We will upload documents to Azure Blob Storage for processing.
2. **Process Data with Azure Document Intelligence**: Utilize Azure's Document Intelligence capabilities to analyze and extract information from the uploaded documents.
3. **Retrieve Information in JSON Format**: Extract and retrieve the processed information in JSON format for further use.

## Importance of Document Processing

Automating document processing is crucial for improving efficiency and accuracy in handling large volumes of data. By leveraging Azure's cloud services, organizations can streamline their workflows, reduce manual errors, and gain valuable insights from their documents. This approach not only saves time and resources but also enhances data accessibility and decision-making capabilities.

### Step 1- Upload Data to Blob

In this step, we will upload our documents to Azure Blob Storage, which serves as a scalable and secure storage solution for our data. The data used is stored in this repo's *data* folder.

In [None]:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import os

# Create a BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(os.getenv('connection_string'))

# Create a container if it doesn't exist
container_client = blob_service_client.get_container_client(os.getenv('container_name'))
try:
    container_client.create_container()
except Exception as e:
    print(f"Container already exists: {e}")

# Upload files in the data folder to the blob container
for filename in os.listdir(os.getenv('data_folder')):
    file_path = os.path.join(os.getenv('data_folder'), filename)
    if os.path.isfile(file_path):
        blob_client = blob_service_client.get_blob_client(container=os.getenv('container_name'), blob=filename)
        with open(file_path, "rb") as data:
            blob_client.upload_blob(data, overwrite=True)
        print(f"Uploaded {filename} to blob storage.")

### Step 2 - Process Data with Azure Document Intelligence

In this step, we will use Azure Document Intelligence to analyze and extract information from the uploaded documents. The code demonstrates how to authenticate with Azure, submit a document for analysis, and retrieve the results, including detected languages, lines, words, and paragraphs.

In [None]:
def get_words(page, line):
    result = []
    for word in page.words:
        if _in_span(word, line.spans):
            result.append(word)
    return result

# To learn the detailed concept of "span" in the following codes, visit: https://aka.ms/spans 
def _in_span(word, spans):
    for span in spans:
        if word.span.offset >= span.offset and (word.span.offset + word.span.length) <= (span.offset + span.length):
            return True
    return False


def analyze_read():
    from azure.core.credentials import AzureKeyCredential
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult, AnalyzeDocumentRequest

    # For how to obtain the endpoint and key, please see PREREQUISITES above.
    endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
    key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]

    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

    # Analyze a document at a URL:
    formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/read.png"
    # Replace with your actual formUrl:
    # If you use the URL of a public website, to find more URLs, please visit: https://aka.ms/more-URLs 
    # If you analyze a document in Blob Storage, you need to generate Public SAS URL, please visit: https://aka.ms/create-sas-tokens
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-read",
        AnalyzeDocumentRequest(url_source=formUrl),
        features=[DocumentAnalysisFeature.LANGUAGES]
    )       
    
    # # If analyzing a local document, remove the comment markers (#) at the beginning of these 11 lines.
    # # Delete or comment out the part of "Analyze a document at a URL" above.
    # # Replace <path to your sample file>  with your actual file path.
    # path_to_sample_document = "<path to your sample file>"
    # with open(path_to_sample_document, "rb") as f:
    #     poller = document_intelligence_client.begin_analyze_document(
    #         "prebuilt-read",
    #         analyze_request=f,
    #         features=[DocumentAnalysisFeature.LANGUAGES],
    #         content_type="application/octet-stream",
    #     )
    result: AnalyzeResult = poller.result()
    
    # [START analyze_read]
    # Detect languages.
    print("----Languages detected in the document----")
    if result.languages is not None:
        for language in result.languages:
            print(f"Language code: '{language.locale}' with confidence {language.confidence}")
    
    # To learn the detailed concept of "bounding polygon" in the following content, visit: https://aka.ms/bounding-region
    # Analyze pages.
    for page in result.pages:
        print(f"----Analyzing document from page #{page.page_number}----")
        print(f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}")

        # Analyze lines.
        if page.lines:
            for line_idx, line in enumerate(page.lines):
                words = get_words(page, line)
                print(
                    f"...Line # {line_idx} has {len(words)} words and text '{line.content}' within bounding polygon '{line.polygon}'"
                )

                # Analyze words.
                for word in words:
                    print(f"......Word '{word.content}' has a confidence of {word.confidence}")
        
    # Analyze paragraphs.
    if result.paragraphs:
        print(f"----Detected #{len(result.paragraphs)} paragraphs in the document----")
        for paragraph in result.paragraphs:
            print(f"Found paragraph within {paragraph.bounding_regions} bounding region")
            print(f"...with content: '{paragraph.content}'")

    print("----------------------------------------")
    # [END analyze_read]

if __name__ == "__main__":
    from azure.core.exceptions import HttpResponseError
    from dotenv import find_dotenv, load_dotenv

    try:
        load_dotenv(find_dotenv())
        analyze_read()
    except HttpResponseError as error:
        # Examples of how to check an HttpResponseError
        # Check by error code:
        if error.error is not None:
            if error.error.code == "InvalidImage":
                print(f"Received an invalid image error: {error.error}")
            if error.error.code == "InvalidRequest":
                print(f"Received an invalid request error: {error.error}")
            # Raise the error again after printing it
            raise
        # If the inner error is None and then it is possible to check the message to get more information:
        if "Invalid request".casefold() in error.message.casefold():
            print(f"Uh-oh! Seems there was an invalid request: {error}")
        # Raise the error again
        raise


### Step 3 - Retrieve Information in JSON Format

In this step, we will extract and retrieve the processed information from the documents in JSON format. This will allow us to further analyze and utilize the extracted data for various purposes.

In [None]:
from azure.ai.documentintelligence import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
import json


# Create a DocumentAnalysisClient
document_analysis_client = DocumentAnalysisClient(endpoint=os.getenv("DOCUMENTINTELLIGENCE_ENDPOINT"), credential=AzureKeyCredential(os.getenv("DOCUMENTINTELLIGENCE_API_KEY")))

# Function to analyze a document from blob storage
def analyze_document(blob_url):
    poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-document", blob_url)
    result = poller.result()
    return result

# Retrieve and process documents from blob storage
container_client = blob_service_client.get_container_client(os.getenv("container_name"))
blob_list = container_client.list_blobs()

for blob in blob_list:
    blob_url = f"https://{blob_service_client.account_name}.blob.core.windows.net/{os.getenv("container_name")}/{blob.name}"
    result = analyze_document(blob_url)
    
    # Convert result to JSON format
    result_json = result.to_dict()
    print(json.dumps(result_json, indent=2))

### Step 4 - Structure the Retrieved Data

In this step, we will structure the data retrieved from Azure Document Intelligence. The data will be outputted as a JSON file, and it is our role to process and organize it. Some of the data will be structured into tables, while other data will be formatted as text. This step ensures that the extracted information is organized in a meaningful way for further analysis and usage.