# Data Extraction - Comprehensive Azure AI Document Intelligence + Azure OpenAI GPT-4o with Vision

This sample demonstrates how to build a comprehensive process to extract structured data from any document using Azure AI Document Intelligence and Azure OpenAI's GPT-4o model with vision capabilities.

![Data Extraction](../../../images/extraction-comprehensive.png)

This is achieved by the following process:

- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the structure as Markdown.
- Construct a system prompt that defines the instruction for extracting structured data from documents.
- Construct a user prompt that includes the specific extraction instruction for the type of document, the text content, and each document page as a base64 encoded image.
- Use the Azure OpenAI chat completions API with the GPT-4o model to generate a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document to Markdown format using Azure AI Document Intelligence.
- Convert a document into a set of base64 encoded images for processing by GPT-4o.
- Use prompt engineering techniques to instruct GPT-4o to extract structured data from a type of document.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from the document page images using Azure OpenAI's GPT-4o model.
- Use the analysis result from Azure AI Document Intelligence to determine the confidence of the extracted structured output.
- Use the [logprobs](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#request-body:~:text=False-,logprobs,-integer) parameter in an OpenAI request to determine the confidence of the extracted structured output.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **pdf2image** for converting a PDF file into a set of images per page.
- **azure-ai-documentintelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **openai** to interface with the Azure OpenAI chat completions API to generate structured extraction outputs using the GPT-4o model.
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local modules are also used:

- **modules.app_settings** to access environment variables from the `.env` file.
- **modules.comparison** to compare the output of the extraction process with expected results.
- **modules.document_intelligence_confidence** to evaluate the confidence of the extraction process based on the extracted structured output and the analysis result from Azure AI Document Intelligence.
- **modules.document_processing_result** to store the results of the extraction process as a file.
- **modules.openai_confidence** to calculate the confidence of the classification process based on the `logprobs` response from the API request.
- **modules.vehicle_insurance_policy** to provide the expected structured output JSON schema for vehicle insurance policy documents.
- **modules.utils** `Stopwatch` to measure the end-to-end execution time for the classification process.

In [50]:
import sys
sys.path.append('../../') # Import local modules

from IPython.display import display, Markdown
import os
import pandas as pd
from dotenv import dotenv_values
from pdf2image import convert_from_bytes
import base64
import io
from openai import AzureOpenAI
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json

from modules.app_settings import AppSettings
from modules.utils import Stopwatch
from modules.accuracy_evaluator import AccuracyEvaluator
from modules.comparison import get_extraction_comparison
from modules.confidence import merge_confidence_values
from modules.document_intelligence_confidence import evaluate_confidence as di_evaluate_confidence
from modules.openai_confidence import evaluate_confidence as oai_evaluate_confidence
from modules.vehicle_insurance_policy import VehicleInsurancePolicy
from modules.document_processing_result import DataExtractionResult

from modules.taxdoc import TaxDocument

### Configure the Azure services

To use Azure AI Document Intelligence and Azure OpenAI, their SDKs are used to create client instances using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [51]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-10-01-preview" # Requires the latest API version for structured outputs.
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

In [52]:
path = f"{working_dir}/samples/assets/taxforms/"
pdf_files = [f for f in os.listdir(path) if f.endswith('.pdf')]
# metadata_fname = "taxform12.json" # Change this to the file you want to evaluate
# metadata_fpath = f"{path}{metadata_fname}"

# with open(metadata_fpath, 'r') as f:
#     data = json.load(f)
    
# expected = TaxDocument(**data['expected'])
# pdf_fname = data['fname']
# pdf_fpath = f"{path}{pdf_fname}"

# tax_evaluator = AccuracyEvaluator(match_keys=[])

## Extract data from the document

The following code block executes the data extraction process using Azure OpenAI's GPT-4o model using vision capabilities.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use Azure AI Document Intelligence to analyze the structure of the document and convert it to Markdown format using the pre-built layout model.
3. Use pdf2image to convert the document's pages into images per page as base64 strings.
4. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the images.

In [53]:
with Stopwatch() as di_stopwatch:
    pdf_file_path = os.path.join(path, pdf_files[0])
    with open(pdf_file_path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            analyze_request=f,
            output_content_format=ContentFormat.MARKDOWN,
            content_type="application/pdf"
        )
    
    result: AnalyzeResult = poller.result()

markdown = result.content

# Write the markdown output to a file
markdown_filename = f"{pdf_files[0].split('.')[0]}_md"
with open(f"{working_dir}/samples/assets/taxforms/{markdown_filename}.md", "w") as md_file:
    md_file.write(markdown)

In [54]:
# Prepare the user content for the OpenAI API including any specific details for processing this type of document, text, and the document page images.
user_content = []
user_content.append({
    "type": "text",
    "text": f"""Extract the data from this tax document. 
    - If a value is not present, provide null.
    - Dates should be in the format YYYY-MM-DD."""
})

user_content.append({
    "type": "text",
    "text": markdown
})

In [55]:
with Stopwatch() as image_stopwatch:
    for pdf_file in pdf_files:
        pdf_file_path = os.path.join(path, pdf_file)
        document_bytes = open(pdf_file_path, "rb").read()
        page_images = convert_from_bytes(document_bytes)
        for page_image in page_images:
            byteIO = io.BytesIO()
            page_image.save(byteIO, format='PNG')
            base64_data = base64.b64encode(byteIO.getvalue()).decode('utf-8')

            user_content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{base64_data}"
                }
            })
            page_image.save(f"{working_dir}/samples/assets/taxforms/page_{page_images.index(page_image) + 1}.png")

In [56]:
with Stopwatch() as oai_stopwatch:
    completion = openai_client.chat.completions.create(
        model=settings.gpt4o_model_deployment_name,
        messages=[
            {
                "role": "system",
                "content": "You are an AI assistant that extracts data from documents and returns the result in JSON format. Please write a clean json file without any special characters",
            },
            {
                "role": "user",
                "content": user_content
            }
        ],
        response_format={ "type": "json_object" },
        max_tokens=4096,
        temperature=0.1,
        top_p=0.1,
        logprobs=True # Enabled to determine the confidence of the response.
    )

In [57]:
# Write the model output to a JSON file
output_filename = f"{pdf_files[0].split('.')[0]}_output.json"
output_filepath = os.path.join(path, output_filename)

with open(output_filepath, "w") as json_file:
    json.dump(completion.choices[0].message.content, json_file, indent=4)

In [58]:
# Displays the output of the Azure AI Document Intelligence pre-built layout analysis in Markdown format.
display(Markdown(markdown))

a

Test Financial Services Inc.
Private Wealth Management 999
Test Location
Apt. 900 - Test BUILDING
Maples IN yyyy00000
CHD0000000000 999999 9

Test Test Program
June 2023

Test BANK & Test N.A.
Test Street ABC Eve.
3111 BLVD SK CA 00000-1989

Account name: Test BANK & TRUST N.A.

Test Name

Account number: 9K 98981212

Your Financial Advisor:

THE Test GROUP

Phone: 91919191/12121212

Questions about your statement? Call
your Financial Advisor or the RMA
ResourceLine at 800-RMA-9999,
account 44XXXX999.

Visit our website:
www.Test.com/financialservices

Items for your attention

4 Help protect yourself from fraud and
review bank, credit card and brokerage
statements regularly. Also, get your free
credit report annually from
www.annualcreditreport.com.

Value of your account
on May 31 ($)

on June 30 ($)


<table>
<tr>
<th>Your assets</th>
<th>2,954,398.87</th>
<th>3,080,882.85</th>
</tr>
<tr>
<td>Your liabilities</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>Value of your account</td>
<td>$2,954,398.87</td>
<td>$3,080,882.85</td>
</tr>
<tr>
<td>Accrued interest in value above</td>
<td>$8,806.88</td>
<td>$6,809.65</td>
</tr>
</table>


As a service to you, your portfolio value of
$3,080,882.85 includes accrued interest.


<table>
<tr>
<th>Dec 2021</th>
<th>Dec 2022</th>
</tr>
<tr>
<td></td>
<td>May 2023</td>
</tr>
<tr>
<td></td>
<td>Jun 2023</td>
</tr>
</table>


# Tracking the value of your account

$ Thousands

3,577.3

2,852.8

2,954.4

3,080.9

Member SIPC

CNP4000000001010104050803070 666666666 00000004 99999999065 2T02798NA0 111100


<figure>

PDF Merger

2

<!-- PageNumber="Page 1 of 54" -->

</figure>


<!-- PageBreak -->


# Sources of your account growth during 2023


<table>
<tr>
<td>Value of your account</td>
<td></td>
</tr>
<tr>
<td>at year end 2022</td>
<td>$2,852,834.99</td>
</tr>
<tr>
<td>Net deposits and</td>
<td></td>
</tr>
<tr>
<td>withdrawals</td>
<td>-$28,316.30</td>
</tr>
<tr>
<td>Your investment return:</td>
<td></td>
</tr>
<tr>
<td>Dividend and</td>
<td></td>
</tr>
<tr>
<td>interest income</td>
<td>$30,053.15</td>
</tr>
<tr>
<td rowspan="2">Change in value of accrued interest</td>
<td></td>
</tr>
<tr>
<td>-$684.16</td>
</tr>
<tr>
<td rowspan="2">Change in market value</td>
<td></td>
</tr>
<tr>
<td>$226,995.17</td>
</tr>
</table>


Value of your account

<!-- PageFooter="CNP700030001015555550803070 0N0P070000000403050100070 99990004 0623" -->


<figure>

PDF Merger

<!-- PageNumber="Page 2 of 54" -->

</figure>


<!-- PageBreak -->


# Your account balance sheet

The value of your account includes assets held at ABC and certain assets held away from ABC. See page 1 for
more information.


## Summary of your assets


<table>
<tr>
<th></th>
<th>Value on June 30 ($)</th>
<th>Percentage of your account</th>
</tr>
<tr>
<td>A Cash and money balances</td>
<td>96,118.72</td>
<td>3.12%</td>
</tr>
<tr>
<td>B Cash alternatives</td>
<td>0.00</td>
<td>0.00%</td>
</tr>
<tr>
<td>C Equities</td>
<td>2,442,159.48</td>
<td>79.27%</td>
</tr>
<tr>
<td>D Fixed income</td>
<td>542,604.65</td>
<td>17.61%</td>
</tr>
<tr>
<td>E Non-traditional</td>
<td>0.00</td>
<td>0.00%</td>
</tr>
<tr>
<td>F Commodities</td>
<td>0.00</td>
<td>0.00%</td>
</tr>
<tr>
<td>G Other</td>
<td>0.00</td>
<td>0.00%</td>
</tr>
<tr>
<td>Total assets</td>
<td>$3,080,882.85</td>
<td>100.00%</td>
</tr>
</table>


Value of your account

$3,080,882.85


<figure>
<figcaption>Your current asset Test</figcaption>

A

D

C

</figure>


4 Cash and money balances may include
available cash balances, ABC Bank USA deposit
account, ABC FDIC Insured Deposit Program Bank
accounts, ABC Insured Sweep Program Bank
accounts, ABC AG EFG Branch deposit account
balances and money market mutual fund sweep
balances. See the Important information about your
statement on the last two pages of this statement
for details about those balances.


## Eye on the markets


<table>
<tr>
<th rowspan="2">Index</th>
<th colspan="2">Percentage change</th>
</tr>
<tr>
<th>June 2023</th>
<th>Year to date</th>
</tr>
<tr>
<td>S&amp;P 500</td>
<td>6.61%</td>
<td>16.89%</td>
</tr>
<tr>
<td>Russell 3000</td>
<td>6.83%</td>
<td>16.17%</td>
</tr>
<tr>
<td>MSCI - Europe, Australia &amp; Far East</td>
<td>4.58%</td>
<td>12.13%</td>
</tr>
<tr>
<td>Barclays Capital U.S. Aggregate Bond Index</td>
<td>-0.36%</td>
<td>2.09%</td>
</tr>
</table>


Interest rates on June 30, 2023

3-month Treasury bills: 5.13%
One-month LIBOR: 5.22%

<!-- PageFooter="Member SIPC" -->
<!-- PageFooter="CNP4000000001010104050803070 666666666 00000004 99999999065 2T02798NA0 111100" -->


<figure>

PDF Merger

2

<!-- PageNumber="Page 1 of 54" -->

</figure>


<!-- PageBreak -->

a

Test Test Program
June 2023


<table>
<tr>
<td>Account name:</td>
<td>XYX BANK &amp; TRUST N.A. 2T 9999</td>
</tr>
<tr>
<td>Account number:</td>
<td>NA</td>
</tr>
</table>


Your Financial Advisor:

THE Test GROUP 239-

000-1100/800-XXX-9999

<!-- PageFooter="CNP700030001015555550803070 0N0P070000000403050100070 99990004 0623" -->


<figure>

PDF Merger

<!-- PageNumber="Page 2 of 54" -->

</figure>


<!-- PageBreak -->

a

<!-- PageHeader="Test Test Program June 2023" -->


<table>
<tr>
<td>Account name:</td>
<td>TEST BANK &amp; TRUST N.A.</td>
</tr>
<tr>
<td>Account number:</td>
<td>0T 00001 NA</td>
</tr>
</table>


<!-- PageHeader="Your Financial Advisor: THE DUMMY GROUP 239-000-1100/800-XXX-8682" -->


## Change in the value of your account


<table>
<tr>
<th></th>
<th>June 2023 ($)</th>
<th>Year to date ($)</th>
</tr>
<tr>
<td>Opening account value</td>
<td>$2,954,398.87</td>
<td>$2,852,834.99</td>
</tr>
<tr>
<td>Withdrawals and fees, including investments transferred out</td>
<td>-15,817.55</td>
<td>-28,316.30</td>
</tr>
<tr>
<td>Dividend and interest income</td>
<td>8,818.14</td>
<td>30,053.15</td>
</tr>
<tr>
<td>Change in value of accrued interest</td>
<td>-1,997.23</td>
<td>-684.16</td>
</tr>
<tr>
<td>Change in market value</td>
<td>135,480.62</td>
<td>226,995.17</td>
</tr>
<tr>
<td>Closing account value</td>
<td>$3,080,882.85</td>
<td>$3,080,882.85</td>
</tr>
</table>


# Dividend and interest income earned

For purposes of this statement, taxability of interest and dividend income has been determined from a US tax
reporting perspective. Based upon the residence of the account holder, account type, or product type, some
interest and/or dividend payments may not be subject to United States (US) and/or Puerto Rico (PR) income taxes.
The client monthly statement is not intended to be used and cannot be relied upon for tax purposes. Clients should
refer to the applicable tax reporting forms they receive from ABC annually, such as the Forms 1099 and the Forms
480, for tax reporting information. It is the practice of ABC to file the applicable tax reporting forms with the US
Internal Revenue Service and PR Treasury Department, and in such forms accurately classify dividends and/or
interest as tax exempt or taxable income. Please consult your individual tax preparer.


<table>
<tr>
<th></th>
<th>June 2023 ($)</th>
<th>Year to date ($)</th>
</tr>
<tr>
<td>Taxable dividends</td>
<td>4,706.74</td>
<td>15,977.42</td>
</tr>
<tr>
<td>Taxable interest</td>
<td>55.84</td>
<td>333.78</td>
</tr>
<tr>
<td>Tax-exempt interest</td>
<td>4,055.56</td>
<td>12,615.56</td>
</tr>
<tr>
<td>Tax-exempt accrued interest paid</td>
<td>0.00</td>
<td>-172.22</td>
</tr>
<tr>
<td>Tax-exempt accrued interest received</td>
<td>0.00</td>
<td>1,298.61</td>
</tr>
<tr>
<td>Total current year</td>
<td>$8,818.14</td>
<td>$30,053.15</td>
</tr>
<tr>
<td>Total dividend &amp; interest</td>
<td>$8,818.14</td>
<td>$30,053.15</td>
</tr>
</table>


# Summary of gains and losses

Values reported below exclude products for which gains and losses are not classified.


<table>
<tr>
<th rowspan="2"></th>
<th colspan="2">Realized gains and losses</th>
<th rowspan="2">Unrealized gains and losses ($)</th>
</tr>
<tr>
<th>June 2023 ($)</th>
<th>Year to date ($)</th>
</tr>
<tr>
<td>Short term</td>
<td>-210.87</td>
<td>-9,269.02</td>
<td>192,668.84</td>
</tr>
<tr>
<td>Long term</td>
<td>-2,881.55</td>
<td>93,361.97</td>
<td>248,735.17</td>
</tr>
<tr>
<td>Total</td>
<td>-$3,092.42</td>
<td>$84,092.95</td>
<td>$441,404.01</td>
</tr>
</table>


<figure>

PDF Merger

2

<!-- PageNumber="Page 3 of 54" -->

</figure>


<!-- PageFooter="CNP70000011111111 XX000011111 00004 0623 010101010 2X020288888100" -->
<!-- PageBreak -->

a

Test Test Program
June 2023

Account name:

XYX BANK & TRUST N.A. 2T 9999

Account number:

NA

Your Financial Advisor:

THE Test GROUP 239-

000-1100/800-XXX-9999

<!-- PageFooter="CNP70000011100000 NA7000115107 00004 0623 000000111 XY02711NA0 111100" -->


<figure>

PDF Merger

<!-- PageNumber="Page 4 of 54" -->

</figure>