# Data Extraction - Azure AI Document Intelligence + Azure OpenAI GPT-4o

This sample demonstrates how to extract structured data from any document using Azure AI Document Intelligence and Azure OpenAI GPT models.

![Data Extraction](../../../images/extraction-document-intelligence-openai.png)

This is achieved by the following process:

- Analyze a document using Azure AI Document Intelligence's `prebuilt-layout` model to extract the structure as Markdown.
- Construct a system prompt that defines the instruction for extracting structured data from documents.
- Construct a user prompt that includes specific extraction instruction for the type of document, and the Markdown content of the document.
- Use the Azure OpenAI chat completions API with the GPT-4o model to generate a structured output from the content.

## Objectives

By the end of this sample, you will have learned how to:

- Convert a document to Markdown format using Azure AI Document Intelligence.
- Use prompt engineering techniques to instruct GPT-4o to extract structured data from a type of document.
- Use the [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure) to extract structured data from a document using Azure OpenAI's GPT-4o model.
- Use the analysis result from Azure AI Document Intelligence to determine the confidence of the extracted structured output.
- Use the [logprobs](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#request-body:~:text=False-,logprobs,-integer) parameter in an OpenAI request to determine the confidence of the extracted structured output.

## Setup

### Import modules

This sample takes advantage of the following Python dependencies:

- **azure-ai-documentintelligence** to interface with the Azure AI Document Intelligence API for analyzing documents.
- **openai** to interface with the Azure OpenAI chat completions API to generate structured extraction outputs using the GPT-4o model.
- **azure-identity** to securely authenticate with deployed Azure Services using Microsoft Entra ID credentials.

The following local modules are also used:

- **modules.app_settings** to access environment variables from the `.env` file.
- **modules.comparison** to compare the output of the extraction process with expected results.
- **modules.document_intelligence_confidence** to evaluate the confidence of the extraction process based on the extracted structured output and the analysis result from Azure AI Document Intelligence.
- **modules.document_processing_result** to store the results of the extraction process as a file.
- **modules.openai_confidence** to calculate the confidence of the classification process based on the `logprobs` response from the API request.
- **modules.invoice** to provide the expected structured output JSON schema for invoice documents.
- **modules.utils** `Stopwatch` to measure the end-to-end execution time for the classification process.

In [17]:
import sys
sys.path.append('../../') # Import local modules

from IPython.display import display, Markdown
import os
import pandas as pd
import json
from dotenv import dotenv_values
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, ContentFormat
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json

from modules.app_settings import AppSettings
from modules.utils import Stopwatch
from modules.accuracy_evaluator import AccuracyEvaluator
from modules.comparison import get_extraction_comparison
from modules.confidence import merge_confidence_values
from modules.document_intelligence_confidence import evaluate_confidence as di_evaluate_confidence
from modules.openai_confidence import evaluate_confidence as oai_evaluate_confidence
from modules.invoice import Invoice
from modules.taxdoc import TaxDocument
from modules.document_processing_result import DataExtractionResult

### Configure the Azure services

To use Azure AI Document Intelligence and Azure OpenAI, their SDKs are used to create client instances using a deployed endpoint and authentication credentials.

For this sample, the credentials of the Azure CLI are used to authenticate with the deployed services.

In [18]:
# Set the working directory to the root of the repo
working_dir = os.path.abspath('../../../')
settings = AppSettings(dotenv_values(f"{working_dir}/.env"))

# Configure the default credential for accessing Azure services using Azure CLI credentials
credential = DefaultAzureCredential(
    exclude_workload_identity_credential=True,
    exclude_developer_cli_credential=True,
    exclude_environment_credential=True,
    exclude_managed_identity_credential=True,
    exclude_powershell_credential=True,
    exclude_shared_token_cache_credential=True,
    exclude_interactive_browser_credential=True
)

openai_token_provider = get_bearer_token_provider(credential, 'https://cognitiveservices.azure.com/.default')

openai_client = AzureOpenAI(
    azure_endpoint=settings.openai_endpoint,
    azure_ad_token_provider=openai_token_provider,
    api_version="2024-10-01-preview" # Requires the latest API version for structured outputs.
)

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=settings.ai_services_endpoint,
    credential=credential
)

### Establish the expected output

To compare the accuracy of the extraction process, the expected output of the extraction process has been defined in the following code block based on the details of an [Invoice](../../assets/invoices/invoice_1.pdf).

> **Note**: More invoice examples can be found in the [assets folder](../../assets/invoices). These examples include the PDF file and an associated JSON metadata file that provides the expected structured output. You can add your own scenarios by following the same structure.

```json
{
    "fname": "<name of the invoice file>",
    "expected": {
        "customer_name": "",
        "customer_address": {
            "street": "",
            "city": "",
            "state": "",
            "postal_code": "",
            "country": ""
        },
        "customer_tax_id": "",
        "shipping_address": "",
        "purchase_order": "",
        "invoice_id": "",
        "invoice_date": "",
        "payable_by": "",
        "vendor_name": "",
        "vendor_address": "",
        "vendor_tax_id": "",
        "remittance_address": "",
        "subtotal": 0,
        "total_discount": 0,
        "total_tax": 0,
        "invoice_total": 0,
        "payment_terms": "",
        "items": [
            {
                "product_code": "",
                "description": "",
                "quantity": 0,
                "tax": 0,
                "tax_rate": "",
                "unit_price": 0,
                "total": 0,
                "reason": null
            }
        ],
        "total_item_quantity": 0,
        "items_customer_signature": {
            "signatory": "",
            "is_signed": true
        },
        "items_vendor_signature": {
            "signatory": "",
            "is_signed": true
        },
        "returns": [
            {
                "product_code": "",
                "description": "",
                "quantity": 0,
                "tax": null,
                "tax_rate": null,
                "unit_price": null,
                "total": null,
                "reason": ""
            }
        ],
        "total_return_quantity": 0,
        "returns_customer_signature": {
            "signatory": "",
            "is_signed": true
        },
        "returns_vendor_signature": {
            "signatory": "",
            "is_signed": true
        }
    }
}
```

The expected output has been defined by a human evaluating the document.

In [19]:
path = f"{working_dir}/samples/assets/taxforms/"
metadata_fname = "taxform12.json" # Change this to the file you want to evaluate
metadata_fpath = f"{path}{metadata_fname}"

with open(metadata_fpath, "r") as f:
    data = json.load(f)
    
expected = TaxDocument(**data['expected'])
pdf_fname = data['fname']
pdf_fpath = f"{path}{pdf_fname}"

tax_evaluator = AccuracyEvaluator(match_keys=['first_name', 'last_name'])

## Extract data from the document

The following code block executes the data extraction process using Azure AI Document Intelligence and Azure OpenAI's GPT-4o model.

It performs the following steps:

1. Get the document bytes from the provided file path. _Note: In this example, we are processing a local document, however, you can use any document storage location of your choice, such as Azure Blob Storage._
2. Use Azure AI Document Intelligence to analyze the structure of the document and convert it to Markdown format using the pre-built layout model.
3. Using Azure OpenAI's GPT-4o model and its [Structured Outputs feature](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/structured-outputs?tabs=python-secure), extract a structured data transfer object (DTO) from the content of the Markdown.

In [20]:
with Stopwatch() as di_stopwatch:
    with open(pdf_fpath, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            analyze_request=f,
            output_content_format=ContentFormat.MARKDOWN,
            content_type="application/pdf"
        )
    
    result: AnalyzeResult = poller.result()

markdown = result.content

In [21]:
with Stopwatch() as oai_stopwatch:
    completion = openai_client.beta.chat.completions.parse(
        model=settings.gpt4o_model_deployment_name,
        messages=[
            {
                "role": "system",
                "content": "You are an AI assistant that extracts data from documents.",
            },
            {
                "role": "user",
                "content": f"""Extract the data from this tax document. 
                - If a value is not present, provide null.
                - Dates should be in the format YYYY-MM-DD.""",
            },
            {
                "role": "user",
                "content": markdown,
            }
        ],
        response_format=TaxDocument,
        max_tokens=4096,
        temperature=0.1,
        top_p=0.1,
        logprobs=True # Enabled to determine the confidence of the response.
    )

### Understanding the Structured Outputs JSON schema

Using [Pydantic's JSON schema feature](https://docs.pydantic.dev/latest/concepts/json_schema/), the [Invoice](../../modules/invoice.py) data model is automatically converted to a JSON schema when applied to the `response_format` parameter of the OpenAI chat completions request.

The JSON schema is used to instruct the GPT-4o model to generate a strict output that adheres to the structure defined. The approach using Pydantic makes it easier for developers to manage the data structure in code, with helpful descriptions and examples that will be included in the final JSON schema.

Demonstrated below, you can see how the Invoice data model is understood by the OpenAI request:

In [22]:
# Highlight the schema sent to the OpenAI model
print(json.dumps(TaxDocument.model_json_schema(), indent=2))

{
  "$defs": {
    "TaxAddress": {
      "description": "A class representing an address in a tax document.\nAttributes:\nstreet: Street address\ncity: City, e.g. Waterloo\nprovince: Province, e.g. ON\npostal_code: Postal code, e.g. N2T 2M5",
      "properties": {
        "street": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "description": "Street address, e.g. 345 Test Street",
          "title": "Street"
        },
        "city": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "description": "City, e.g. Waterloo",
          "title": "City"
        },
        "province": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "description": "Provinc

## Visualize the outputs

To provide context for the execution of the code, the following code blocks visualize the outputs of the data extraction process.

This includes:

- The Markdown representation of the document structure as determined by Azure AI Document Intelligence.
- The accuracy of the structured data extraction comparing the expected output with the output generated by Azure OpenAI's GPT-4o model.
- The confidence score of the structured data extraction by comparing against the Azure AI Document Intelligence analysis.
- The execution time of the end-to-end process.
- The total number of tokens consumed by the GPT-4o model.
- The side-by-side comparison of the expected output and the output generated by Azure OpenAI's GPT-4o model.

### Understanding Accuracy vs Confidence

When using AI to extract structured data, both confidence and accuracy are essential for different but complementary reasons.

- **Accuracy** measures how close the AI model's output is to a ground truth or expected output. It reflects how well the model's predictions align with reality.
  - Accuracy ensures consistency in the extraction process, which is crucial for downstream tasks using the data.
- **Confidence** represents the AI model's internal assessment of how certain it is about its predictions.
  - Confidence indicates that the model is certain about its predictions, which can be a useful indicator for human reviewers to step in for manual verification.

High accuracy and high confidence are ideal, but in practice, there is often a trade-off between the two. While accuracy cannot always be self-assessed, confidence scores can and should be used to prioritize manual verification of low-confidence predictions.

In [23]:
# Displays the output of the Azure AI Document Intelligence pre-built layout analysis in Markdown format.
display(Markdown(markdown))

<!-- PageHeader="Clear Data" -->


<figure>

Canada Revenue
Agency

Agence du revenu
du Canada

</figure>


# Income Tax and Benefit Return

<!-- PageNumber="T1 2023" -->

Protected B when completed

If this return is for a deceased person, enter their information on this page. For more information, see Guide T4011,
Preparing Returns for Deceased Persons.

Attach to your paper return only the documents that are requested to support your deduction, claim or expense. Keep all other
documents in case the Canada Revenue Agency (CRA) asks to see them later.


## Step 1 - Identification and other information

8


<table>
<tr>
<th>Identification</th>
<th rowspan="2">Last name</th>
<th rowspan="2">Social insurance number (SIN)</th>
<th rowspan="2">Marital status on December 31, 2023:</th>
</tr>
<tr>
<th>First name</th>
</tr>
<tr>
<td>John</td>
<td>Doe</td>
<td>999555123</td>
<td>1 ☒ ✔ Married</td>
</tr>
<tr>
<td colspan="2">Mailing address (apartment - number, street) 123 Fake Street</td>
<td>Date of birth</td>
<td>2 ☐ Living common-law</td>
</tr>
<tr>
<td>PO Box</td>
<td>RR</td>
<td>(Year Month Day) 19771005</td>
<td>3 ☐ Widowed</td>
</tr>
<tr>
<td>City</td>
<td>Prov./Terr. Postal code</td>
<td>If this return is for</td>
<td>4 ☐ Divorced</td>
</tr>
<tr>
<td>Waterloo</td>
<td>ON N2T2M5</td>
<td>a deceased person, enter the date of death</td>
<td>5 ☐ Separated</td>
</tr>
<tr>
<td>Email address</td>
<td></td>
<td>(Year Month Day)</td>
<td></td>
</tr>
<tr>
<td>john.doe@noreply.com</td>
<td></td>
<td></td>
<td>6 ☐ Single</td>
</tr>
</table>


By providing an email address, you are registering to receive
email notifications from the CRA and agree to the Terms of
use. To view the Terms of use, go to canada.ca/cra-email
-notifications-terms.

Your language of correspondence:
☒
English
Votre langue de correspondance :
☐
✔
Français


### Residence information

Your province or territory of residence on December 31, 2023:
Ontario

Your current province or territory of residence if it is different
than your mailing address above:

Ontario

Province or territory where your business had a permanent
establishment if you were self-employed in 2023:

Ontario


<table>
<tr>
<th>If you became a resident of Canada in 2023 for income tax purposes,</th>
<th>(Month Day)</th>
</tr>
<tr>
<th>enter your date of entry:</th>
<th></th>
</tr>
<tr>
<td rowspan="2">If you ceased to be a resident of Canada in 2023 for income tax purposes, enter your date of departure:</td>
<td>(Month Day)</td>
</tr>
<tr>
<td></td>
</tr>
</table>


Your spouse's or common-law partner's information

Their first name
Jane

Their SIN
999555124


<table>
<tr>
<td>Tick this box if they were self-employed in 2023.</td>
<td>1 ☒ ✔</td>
<td></td>
</tr>
<tr>
<td>Net income from line 23600 of their return to claim certain credits (or the amount that it would be if they filed a return, even if the amount is "0")</td>
<td>150,000,00</td>
<td></td>
</tr>
<tr>
<td>Amount of universal child care benefit (UCCB) from line 11700 of their return</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Amount of UCCB repayment from line 21300 of their return</td>
<td></td>
<td></td>
</tr>
</table>


Do not use this area.

PDF Merger


<figure>

<!-- PageNumber="Page 1 of 8" -->

</figure>


<table>
<tr>
<td>Do not use</td>
<td>17100</td>
<td></td>
</tr>
<tr>
<td>this area.</td>
<td>17200</td>
<td></td>
</tr>
</table>


<!-- PageFooter="5006-R E (23)" -->
<!-- PageFooter="(Ce formulaire est disponible en français.)" -->
<!-- PageBreak -->

<!-- PageHeader="Clear Data" -->

Protected B when completed


#### Step 1 - Identification and other information (continued)


<figure>

Elections Canada

</figure>


Elections Canada

For more information, go to canada.ca/cra-elections-canada.

A) Do you have Canadian citizenship?
If yes, go to question B. If no, skip question B.
☒
✔

1
Yes 2
☐
No

B) As a Canadian citizen, do you authorize the CRA to give your name, address, date of birth
and citizenship to Elections Canada to update the National Register of Electors or, if you are
14 to 17 years of age, the Register of Future Electors?
1
☐
Yes 2
☒
No
✔

Your authorization is valid until you file your next tax return. Your information will only be used for purposes permitted
under the Canada Elections Act, which include sharing lists of electors produced from the National Register of Electors
with provincial and territorial electoral agencies, members of Parliament, registered and eligible political parties, and
candidates at election time.

Your information in the Register of Future Electors will be included in the National Register of Electors once you turn 18
and your eligibility to vote is confirmed. Information from the Register of Future Electors can be shared only with provincial
and territorial electoral agencies that are allowed to collect future elector information. In addition, Elections Canada can use
information in the Register of Future Electors to provide youth with educational information about the electoral process.

Indian Act - Exempt income

Tick this box if you have income that is exempt under the Indian Act.
For more information about this type of income, go to canada.ca/taxes-indigenous-peoples.
☐
1

If you ticked the box above, complete Form T90, Income Exempt from Tax under the Indian Act, so that the CRA can
calculate your Canada workers benefit for the 2023 tax year, if applicable, and your family's provincial or territorial benefits.
The information you provide on Form T90 will also be used to calculate your Canada training credit limit for the 2024 tax year.


##### Climate action incentive payment

Tick this box if you reside outside of the census metropolitan areas (CMA) of Barrie, Belleville,
Brantford, Greater Sudbury, Guelph, Hamilton, Kingston, Kitchener-Cambridge-Waterloo,
London, Oshawa, the Ontario part of Ottawa-Gatineau, Peterborough, St. Catharines-Niagara,
Thunder Bay, Toronto or Windsor, as determined by Statistics Canada (2016), and expect to
continue to reside outside the same CMA on April 1, 2024.
1

☒
✔

Note: If your marital status is married or living common-law, and both you and your spouse or common-law partner were
residing in the same location outside of a CMA, you must tick this box on both of your returns.


##### Foreign property

Did you own or hold specified foreign property where the total cost amount of all such property,
at any time in 2023, was more than CAN$100,000?
26600
1
☐
Yes 2
☒
No
If yes, complete Form T1135, Foreign Income Verification Statement. There are substantial penalties for not filing
Form T1135 by the due date. For more information, see Form T1135.
✔


##### Consent to share contact information - Organ and tissue donor registry

I authorize the CRA to provide my name and email address to Ontario Health so that Ontario Health (Trillium
Gift of Life) may contact or send information to me by email about organ and tissue donation.
For more information about organ and tissue donation in Canada, go to canada.ca/organ-tissue-donation.
1
☒
YEDF Merge No

Note: You are not consenting to organ and tissue donation when you authorize the CRA to share your contact information
with Ontario Health. Your authorization is only valid for the tax year for which you are filing this tax return. Your
information will only be collected under the Ontario Gift of Life Act.


<figure>

<!-- PageNumber="Page 2 of 8" -->

</figure>


<!-- PageFooter="5006-R E (23)" -->


In [24]:
# Gets the parsed Invoice object from the completion response.
invoice = completion.choices[0].message.parsed

expected_dict = expected.to_dict()
invoice_dict = invoice.to_dict()

In [25]:
# Determines the accuracy of the extracted data against the expected values.
accuracy = tax_evaluator.evaluate(expected=expected_dict, actual=invoice_dict)

In [26]:
# Determines the confidence of the extracted data using both the OpenAI and Azure Document Intelligence responses.
di_confidence = di_evaluate_confidence(invoice_dict, result)
oai_confidence = oai_evaluate_confidence(invoice_dict, completion.choices[0])

confidence = merge_confidence_values(di_confidence, oai_confidence)

In [27]:
# Gets the total execution time of the data extraction process.
total_elapsed = di_stopwatch.elapsed + oai_stopwatch.elapsed

# Gets the prompt tokens and completion tokens from the completion response.
prompt_tokens = completion.usage.prompt_tokens
completion_tokens = completion.usage.completion_tokens

In [28]:
# Save the output of the data extraction result.
extraction_result = DataExtractionResult(invoice_dict, confidence, accuracy, prompt_tokens, completion_tokens, total_elapsed)

with open(f"{working_dir}/samples/extraction/text-based/document-intelligence-openai.{pdf_fname}.json", "w") as f:
    f.write(extraction_result.to_json(indent=4))

In [29]:
# Display the outputs of the data extraction process.
df = pd.DataFrame([
    {
        "Accuracy": f"{accuracy['overall'] * 100:.2f}%",
        "Confidence": f"{confidence['_overall'] * 100:.2f}%",
        "Execution Time": f"{total_elapsed:.2f} seconds",
        "Document Intelligence Execution Time": f"{di_stopwatch.elapsed:.2f} seconds",
        "OpenAI Execution Time": f"{oai_stopwatch.elapsed:.2f} seconds",
        "Prompt Tokens": prompt_tokens,
        "Completion Tokens": completion_tokens
    }
])

display(df)
display(get_extraction_comparison(expected_dict, invoice_dict, confidence, accuracy['accuracy']))

Unnamed: 0,Accuracy,Confidence,Execution Time,Document Intelligence Execution Time,OpenAI Execution Time,Prompt Tokens,Completion Tokens
0,37.50%,98.17%,7.70 seconds,4.78 seconds,2.92 seconds,2925,116


Unnamed: 0,Field,Expected,Extracted,Confidence,Accuracy
0,address_city,Toronto,Waterloo,99.50%,Mismatch
1,address_postal_code,A1B 2C3,N2T2M5,98.34%,Mismatch
2,address_province,Ontario,ON,99.40%,Mismatch
3,address_street,Main St,123 Fake Street,99.20%,Mismatch
4,business_province,,Ontario,99.50%,Mismatch
5,current_province,,Ontario,99.50%,Mismatch
6,date_of_birth,1980-01-01,1977-10-05,100.00%,Mismatch
7,first_name,John,John,98.40%,Match
8,language_of_correspondence,English,English,99.60%,Match
9,last_name,Doe,Doe,99.40%,Match
