# Using Azure AI Document Intelligence and Azure OpenAI GPT-3.5 Turbo to extract structured data from documents

This notebook demonstrates [how to use the new Markdown content extraction feature of Azure AI Document Intelligence's pre-built Layout model](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/document-intelligence-preview-adds-more-prebuilts-support-for/ba-p/4084608) to convert documents, such as invoices, into Markdown, then use GPT-3.5 Turbo to extract structured JSON data using the [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview).

## Pre-requisites

The notebook uses [PowerShell](https://learn.microsoft.com/powershell/scripting/install/installing-powershell) and [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) to deploy all necessary Azure resources. Both tools are available on Windows, macOS and Linux environments. It also uses [.NET 8](https://dotnet.microsoft.com/download/dotnet/8.0) to run the C# code that interacts with Azure AI Document Intelligence and Azure OpenAI Service.

Running this notebook will deploy the following resources in your Azure subscription:
- Azure Resource Group
- Azure AI Document Intelligence (East US)
- Azure OpenAI Service (East US)
- GPT-3.5 Turbo 16K model deployment (120K capacity)

**Note**: Any GPT-3.5 Turbo model can be used with this sample. To provide a single region deployment for both the Azure AI Document Intelligence and Azure OpenAI Service, the GPT-3.5 Turbo 16K model is used in this sample in the East US region. For more information on the available GPT models, see the [Azure OpenAI Service documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35). For more information on the available regions for Azure AI Document Intelligence preview features, see the [Azure AI Document Intelligence documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/whats-new?view=doc-intel-4.0.0&tabs=csharp#february-2024).

## Deploy infrastructure with Az CLI & Bicep

The following will prompt you to login to Azure. Once logged in, the current default subscription in your available subscriptions will be set for deployment.

> **Note:** If you have multiple subscriptions, you can change the default subscription by running `az account set --subscription <subscription_id>`.

Then, all the necessary Azure resources will be deployed, previously listed, using [Azure Bicep](https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/).

The deployment occurs at the subscription level, creating a new resource group. The location of the deployment is set to **East US** and this can be changed to another location that supports the combination of [Azure AI Document Intelligence preview features](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/whats-new?view=doc-intel-4.0.0&tabs=csharp#february-2024) and a GPT-3.5 Turbo model deployment. This can be changed in the parameters provided to the PowerShell script in the next cell. You can tweak the model details also in the [`main.bicep`](./infra/main.bicep) file.

Once deployed, the endpoints and API keys will be stored in the [`./config.env`](./config.env) file for use in the .NET code.

### Understanding the deployment

#### AI Document Intelligence

An [Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview) instance is deployed in the East US region. This is to enable support for the new Markdown content extraction feature of the pre-built Layout model.

#### OpenAI Services

An [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview) instance is deployed in the East US region. This is deployed with the `gpt-35-turbo-16k` model to be used for inference.

In [None]:
# Login to Azure
Write-Host "Checking if logged in to Azure..."

$loggedIn = az account show --query "name" -o tsv

if ($loggedIn -ne $null) {
    Write-Host "Already logged in as $loggedIn"
} else {
    Write-Host "Logging in..."
    az login
}

# Retrieve the default subscription ID
$subscriptionId = (
    (
        az account list -o json `
            --query "[?isDefault]"
    ) | ConvertFrom-Json
).id

# Set the subscription
az account set --subscription $subscriptionId
Write-Host "Subscription set to $subscriptionId"

In [None]:
# Run Deploy-Infrastructure.ps1
.\Deploy-Infrastructure.ps1 -DeploymentName 'docintel-gpt-document-extraction' -Location 'eastus'

## Install .NET dependencies

This notebook uses .NET to interact with the Azure AI Document Intelligence and Azure OpenAI Service. It takes advantage of the following NuGet packages:

### Azure.AI.DocumentIntelligence

The [Azure.AI.DocumentIntelligence](https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/documentintelligence/Azure.AI.DocumentIntelligence) library is used to interact with the Azure AI Document Intelligence service. This library provides a client to interact with the Azure AI Document Intelligence service to perform operations such as analyzing documents and extracting information from them.

### Azure.AI.OpenAI

The [Azure.AI.OpenAI](https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/openai/Azure.AI.OpenAI) library is used to interact with the Azure OpenAI Service. This library provides a client to interact with the Azure OpenAI Service to perform inference on the GPT-3.5 Turbo model.

### DotNetEnv

The [DotNetEnv](https://github.com/tonerdo/dotnet-env) library is used to load environment variables from a `.env` file which can be accessed via the `Environment.GetEnvironmentVariable(string)` method. This library is used to load the Azure OpenAI Service endpoint, key and model deployment name from the [`./config.env`](./config.env) file.

In [None]:
#r "nuget:System.Text.Json, 8.0.1"
#r "nuget:DotNetEnv, 3.0.0"
#r "nuget:Azure.AI.OpenAI, 1.0.0-beta.14"
#r "nuget:Azure.AI.DocumentIntelligence, 1.0.0-beta.2"

In [5]:
using System.Net;
using System.Net.Http;
using System.Text.Json.Nodes;
using System.Text.Json;
using System.IO; 

using Azure;
using Azure.AI.DocumentIntelligence;
using Azure.AI.OpenAI;
using DotNetEnv;

In [7]:
Env.Load("config.env");

var openAIEndpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
var openAIApiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY");
var openAIModelDeployment = Environment.GetEnvironmentVariable("AZURE_OPENAI_MODEL_DEPLOYMENT_NAME");
var openAIApiVersion = "2023-12-01-preview";
var documentIntelligenceEndpoint = Environment.GetEnvironmentVariable("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT");
var documentIntelligenceApiKey = Environment.GetEnvironmentVariable("AZURE_DOCUMENT_INTELLIGENCE_KEY");

var pdfName = "Invoice_1.pdf";

var documentIntelligenceClient = new DocumentIntelligenceClient(new Uri(documentIntelligenceEndpoint), new AzureKeyCredential(documentIntelligenceApiKey));
var openAIClient = new OpenAIClient(new Uri(openAIEndpoint), new AzureKeyCredential(openAIApiKey));

## Perform layout analysis on a document to extract Markdown content

To be able to extract structured JSON data from a document, the document must first be converted into Markdown content. The following code demonstrates how to use the Azure AI Document Intelligence SDK to perform layout analysis on a document and return the result as Markdown.

### Important notes for document analysis with Azure AI Document Intelligence

- The document must be in one of the supported formats: PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX, or HTML.
- The document file size limit is 500MB.
- You can process only specific pages of a document by specifying the page numbers in the `pages` parameter of the `AnalyzeDocumentAsync` method in the example format `1,3-5,7-10`.

In [8]:
var markdownAnalysisContent = new AnalyzeDocumentContent()
{
    Base64Source = BinaryData.FromBytes(File.ReadAllBytes(pdfName))
};

Operation<AnalyzeResult> markdownAnalysisOperation = await documentIntelligenceClient.AnalyzeDocumentAsync(WaitUntil.Completed, "prebuilt-layout", markdownAnalysisContent, outputContentFormat: ContentFormat.Markdown);
var markdown = markdownAnalysisOperation.Value.Content;

## Use GPT-3.5 Turbo to extract structured JSON data from the Markdown content

Now that the document has been converted to Markdown, the GPT-3.5 Turbo model can be used to extract structured JSON data from the content. The following code demonstrates how to use the deployed Azure OpenAI Service using the .NET SDK to extract the structured JSON data.

In this example, the payload object contains the following details:

### System Prompt

The system prompt is the instruction to the model that prescribes the model's behavior. They allow you to constrain the model's behavior to a specific task, making it more adaptable for specific use cases, such as extracting structured JSON data from documents.

In this case, it is to extract structured JSON data from the content. Here is what we have provided:

**You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block.**

Learn more about [system prompts](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message).

### User Prompt

The user prompt is the input to the model that provides context for the model's response. It is the input that the model uses to generate a response. 

In this case, it is the content of the document plus some additional text context to help the model understand the task. Here is what we have provided:

**Extract the data from this invoice. If a value is not present, provide null. Use the following structure: {"company_name":"","invoice_date":"","products":[{"id":"","unit_price":"","quantity":"","total":""}],"total_amount":"","signatures":[{"type":"","has_signature":"","signed_on":""}]}**

Followed by the Markdown content extracted from the document.

> **Note:** For the user prompt, it is ideal to provide a structure for the JSON response. Without one, the model will determine this for you and you may not get consistency across responses. 

This prompt ensures that the model understands the task, and the additional text context provides the model with the necessary information to extract the structured JSON data from the Markdown. This approach would result in a response similar to the following:

```json
{
  "company_name": "CONTOSO",
  "invoice_date": "2/27/2024",
  "products": [
    {
      "id": "5-01-XX",
      "unit_price": "1.00",
      "quantity": "1.0",
      "total": "1.00"
    },
    {
      "id": "5-02-XX",
      "unit_price": "1.50",
      "quantity": "5.0",
      "total": "7.50"
    }
  ],
  "total_amount": "8.50",
  "signatures": [
    {
      "type": "Distributor",
      "has_signature": true,
      "signed_on": "2/27/2024"
    },
    {
      "type": "Customer",
      "has_signature": false,
      "signed_on": null
    }
  ]
}
```

In [9]:
var jsonStructure = new {
    company_name = "",
    invoice_date = "",
    products = new [] {
        new {
            id = "",
            unit_price = "",
            quantity = "",
            total = ""
        }
    },
    total_amount = "",
    signatures = new [] {
        new {
            type = "",
            has_signature = "",
            signed_on = ""
        }
    }
};

ChatCompletionsOptions options = new ChatCompletionsOptions()
{
    DeploymentName = openAIModelDeployment,
    MaxTokens = 4096,
    Temperature = 0.1f,
    NucleusSamplingFactor = 0.1f
};

options.Messages.Add(new ChatRequestSystemMessage("You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block."));
options.Messages.Add(new ChatRequestUserMessage($"Extract the data from this invoice. If a value is not present, provide null. Use the following structure: {JsonSerializer.Serialize(jsonStructure)}"));
options.Messages.Add(new ChatRequestUserMessage(markdown));

Response<ChatCompletions> response = await openAIClient.GetChatCompletionsAsync(options);
foreach(var completion in response.Value.Choices)
{
    Console.WriteLine(completion.Message.Content);
}

{
  "company_name": "CONTOSO",
  "invoice_date": "2/27/2024",
  "products": [
    {
      "id": "5-01-XX",
      "unit_price": "1.00",
      "quantity": "1.0",
      "total": "1.00"
    },
    {
      "id": "5-02-XX",
      "unit_price": "1.50",
      "quantity": "5.0",
      "total": "7.50"
    },
    {
      "id": "5-03-XX",
      "unit_price": "5.75",
      "quantity": "2.0",
      "total": "11.50"
    },
    {
      "id": "5-04-XX",
      "unit_price": "2.80",
      "quantity": "6.0",
      "total": "16.80"
    },
    {
      "id": "5-05-XX",
      "unit_price": "4.45",
      "quantity": "13.0",
      "total": "57.85"
    },
    {
      "id": "5-06-XX",
      "unit_price": "2.20",
      "quantity": "11.0",
      "total": "24.20"
    },
    {
      "id": "5-07-XX",
      "unit_price": "20.05",
      "quantity": "5.0",
      "total": "100.25"
    },
    {
      "id": "5-08-XX",
      "unit_price": "9.50",
      "quantity": "1.0",
      "total": "9.50"
    },
    {
      "id": "5-09-X