# Extract Custom Fields from Your File

This notebook demonstrates how to use analyzers to extract custom fields from your input files.

Content Understanding provides **extensive prebuilt analyzers** ready to use without training. Always start with prebuilt analyzers before building custom solutions.

## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
2. Install the required packages to run the sample.

In [None]:
%pip install -r ../requirements.txt

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK. Fill the constant **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ‚ö†Ô∏è Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ‚ö†Ô∏è Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments.

In [None]:
from datetime import datetime
import logging
import json
import os
import sys
import asyncio
from dotenv import find_dotenv, load_dotenv

# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from content_understanding_client import AzureContentUnderstandingClient
from extension.document_processor import DocumentProcessor
from extension.sample_helper import save_json_to_file 
from azure.identity import DefaultAzureCredential

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set it in your ".env" file if not using token authentication
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
API_VERSION = "2025-11-01"

# Create token provider for Azure AD authentication
def token_provider():
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")
    return token.token

# Create the Content Understanding client
client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=API_VERSION,
    subscription_key=AZURE_AI_API_KEY,
    token_provider=token_provider if not AZURE_AI_API_KEY else None
)
print("‚úÖ ContentUnderstandingClient created successfully")

try:
    processor = DocumentProcessor(client)
    print("‚úÖ DocumentProcessor created successfully")
except Exception as e:
    print(f"‚ùå Failed to create DocumentProcessor: {e}")
    raise

## Configure Model Deployments for Prebuilt Analyzers

> **üí° Note:** This step is only required **once per Azure Content Understanding resource**, unless the GPT deployment has been changed. You can skip this section if:
> - This configuration has already been run once for your resource, or
> - Your administrator has already configured the model deployments for you

Before using prebuilt analyzers, you need to configure the default model deployment mappings. This tells Content Understanding which model deployments to use.

**Model Requirements:**
- **GPT-4.1** - Required for most prebuilt analyzers (e.g., `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`)
- **GPT-4.1-mini** - Required for RAG analyzers (e.g., `prebuilt-documentSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch`)
- **text-embedding-3-large** - Required for all prebuilt analyzers that use embeddings

**Prerequisites:**
1. Deploy **GPT-4.1**, **GPT-4.1-mini**, and **text-embedding-3-large** models in Azure AI Foundry
2. Set `GPT_4_1_DEPLOYMENT`, `GPT_4_1_MINI_DEPLOYMENT`, and `TEXT_EMBEDDING_3_LARGE_DEPLOYMENT` in your `.env` file with the deployment names


In [None]:
# Get model deployment names from environment variables
GPT_4_1_DEPLOYMENT = os.getenv("GPT_4_1_DEPLOYMENT")
GPT_4_1_MINI_DEPLOYMENT = os.getenv("GPT_4_1_MINI_DEPLOYMENT")
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = os.getenv("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

# Check if required deployments are configured
missing_deployments = []
if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ö†Ô∏è  Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print("\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments.")
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print("      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>")
    print("   3. Restart the kernel and run this cell again")
else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}")
    
    try:
        # Update defaults to map model names to your deployments
        result = client.update_defaults({
            "gpt-4.1": GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini": GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large": TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        
        print(f"‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(f"   - One or more deployment names don't exist in your Azure AI Foundry project")
        print(f"   - You don't have permission to update defaults")
        raise


# Part 1: Using Prebuilt Analyzers (Recommended Starting Point)

## Why Start with Prebuilt Analyzers?

Azure AI Content Understanding provides **70+ production-ready prebuilt analyzers** that cover common scenarios across finance, healthcare, legal, tax, and business domains. These analyzers are:

- **Immediately Available** - No training, configuration, or customization needed  
- **Battle-Tested** - Built on rich knowledge bases of thousands of real-world document examples  
- **Continuously Improved** - Regularly updated by Microsoft to handle document variations  
- **Cost-Effective** - Save development time and resources by using proven solutions  
- **Comprehensive Coverage** - Extensive support for Financial documents (invoices, receipts, bank statements, credit cards), Identity documents (passports, driver licenses, ID cards, health insurance), Tax documents (40+ US tax forms including 1040, W-2, 1099 variants), Mortgage documents (applications, appraisals, disclosures), Business documents (contracts, purchase orders, procurement), and many more specialized scenarios

> **Best Practice**: Always explore prebuilt analyzers first. Build custom analyzers only when prebuilt options don't meet your specific requirements.

### Complete List of Prebuilt Analyzer Categories

**Content Extraction & RAG**
- `prebuilt-read`, `prebuilt-layout` - OCR and layout analysis
- `prebuilt-documentSearch`, `prebuilt-imageSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch` - RAG-optimized

**Financial Documents**
- `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-creditCard`, `prebuilt-bankStatement.us`, `prebuilt-check.us`, `prebuilt-creditMemo`

**Identity & Healthcare**  
- `prebuilt-idDocument`, `prebuilt-idDocument.passport`, `prebuilt-healthInsuranceCard.us`

**Tax Documents (US)**
- 40+ tax form analyzers including `prebuilt-tax.us.1040`, `prebuilt-tax.us.w2`, all 1099 variants, 1098 series, and more

**Mortgage Documents (US)**
- `prebuilt-mortgage.us.1003`, `prebuilt-mortgage.us.1004`, `prebuilt-mortgage.us.1005`, `prebuilt-mortgage.us.closingDisclosure`

**Legal & Business**
- `prebuilt-contract`, `prebuilt-procurement`, `prebuilt-purchaseOrder`, `prebuilt-marriageCertificate.us`

**Other Specialized**
- `prebuilt-utilityBill`, `prebuilt-payStub.us`, and more

> **Learn More**: [Complete Prebuilt Analyzers Documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/prebuilt-analyzers)

## Build Custom Analyzers (When Needed)

Create custom analyzers only when prebuilt ones don't meet your needs:
- Extract fields specific to your business
- Process proprietary document types
- Customize extraction logic for unique requirements

**This notebook demonstrates both approaches:**
1. **Part 1**: Using prebuilt analyzers (receipts, invoices)
2. **Part 2**: Creating custom analyzers when prebuilt options aren't sufficient

## 1. Invoice Field Extraction with Prebuilt Analyzer

Let's demonstrate using `prebuilt-invoice` to extract structured data from an invoice PDF. This analyzer automatically identifies vendor information, invoice numbers, dates, line items, totals, taxes, and payment details without any configuration.


In [None]:
sample_file_path = '../data/invoice.pdf'
invoice_analyzer_id = "prebuilt-invoice"

print(f"üîç Analyzing {sample_file_path} with {invoice_analyzer_id}...")

analysis_response = client.begin_analyze_binary(
    analyzer_id=invoice_analyzer_id,
    file_location=sample_file_path,
)

# Wait for analysis completion
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Document analysis completed successfully!")


**Invoice Analysis Results**

Let's examine the extracted fields from the invoice:


In [None]:
# Display comprehensive results
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])
    
    if contents:
        first_content = contents[0]
        
        # Display extracted fields
        fields = first_content.get("fields", {})
        print("üìä Extracted Fields:")
        print("-" * 80)
        if fields:
            for field_name, field_value in fields.items():
                field_type = field_value.get("type")
                if field_type == "string":
                    print(f"{field_name}: {field_value.get('valueString')}")
                elif field_type == "number":
                    print(f"{field_name}: {field_value.get('valueNumber')}")
                elif field_type == "date":
                    print(f"{field_name}: {field_value.get('valueDate')}")
                elif field_type == "array":
                    print(f"{field_name} (array with {len(field_value.get('valueArray', []))} items):")
                    for idx, item in enumerate(field_value.get('valueArray', []), 1):
                        if item.get('type') == 'object':
                            print(f"  Item {idx}:")
                            for key, val in item.get('valueObject', {}).items():
                                if val.get('type') == 'string':
                                    print(f"    {key}: {val.get('valueString')}")
                                elif val.get('type') == 'number':
                                    print(f"    {key}: {val.get('valueNumber')}")
                                # Display confidence and source for nested fields
                                if val.get('confidence') is not None:
                                    print(f"      Confidence: {val.get('confidence'):.3f}")
                                if val.get('source'):
                                    print(f"      Bounding Box: {val.get('source')}")
                elif field_type == "object":
                    print(f"{field_name}: {field_value.get('valueObject')}")
                
                # Display confidence and bounding box for the field
                confidence = field_value.get('confidence')
                if confidence is not None:
                    print(f"  Confidence: {confidence:.3f}")
                source = field_value.get('source')
                if source:
                    print(f"  Bounding Box: {source}")
                print()
        else:
            print("No fields extracted")
        print()
        
        # Display content metadata
        print("üìã Content Metadata:")
        print("-" * 80)
        print(f"Kind: {first_content.get('kind')}")
        if first_content.get("kind") == "document":
            start_page = first_content.get("startPageNumber", 0)
            end_page = first_content.get("endPageNumber", 0)
            print(f"Pages: {start_page} - {end_page}")
            print(f"Total pages: {end_page - start_page + 1}")
        print()
    
    # Save full result to file
    saved_file_path = save_json_to_file(analysis_result, filename_prefix="prebuilt_invoice_analysis_result")
    print(f"üíæ Full analysis result saved. Review the complete JSON at: {saved_file_path}")
else:
    print("No analysis result available")


## 2. Receipt Field Extraction with Prebuilt Analyzer

Let's demonstrate using `prebuilt-receipt` to extract structured data from a receipt image. This analyzer automatically identifies merchant information, items, totals, taxes, and payment details without any configuration.


In [None]:
sample_file_path = '../data/receipt.png'
receipt_analyzer_id = "prebuilt-receipt"

print(f"üîç Analyzing {sample_file_path} with {receipt_analyzer_id}...")

analysis_response = client.begin_analyze_binary(
    analyzer_id=receipt_analyzer_id,
    file_location=sample_file_path,
)

# Wait for analysis completion
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Document analysis completed successfully!")


**Receipt Analysis Results**

Let's examine the extracted fields from the receipt:


In [None]:
# Save the analysis result to a file
saved_file_path = save_json_to_file(analysis_result, filename_prefix="prebuilt_receipt_analysis_result")
# Print the full analysis result as a JSON string
print(json.dumps(analysis_result, indent=2))


# Custom Analyzers

Now let's explore creating custom analyzers to extract specific fields tailored to your needs. Custom analyzers allow you to define exactly what information you want to extract and how it should be structured.


**Key Analyzer Configuration Components:**

- **`baseAnalyzerId`**: Specifies which prebuilt analyzer to inherit from. Available base analyzers:
  - **`prebuilt-document`** - For document-based custom analyzers (PDFs, images, Office docs)
  - **`prebuilt-audio`** - For audio-based custom analyzers
  - **`prebuilt-video`** - For video-based custom analyzers
  - **`prebuilt-image`** - For image-based custom analyzers

- **`fieldSchema`**: Defines the structured data to extract from content:
  - **`fields`**: Object defining each field to extract, with field names as keys
  - Each field definition includes:
    - **`type`**: Data type (`string`, `number`, `boolean`, `date`, `object`, `array`)
    - **`description`**: Clear explanation of the field - acts as a prompt to guide extraction accuracy
    - **`method`**: Extraction method to use:
      - **`"extract"`** - Extract values as they appear in content (literal text extraction). Requires `estimateSourceAndConfidence: true`. Only supported for document analyzers.
      - **`"generate"`** - Generate values using AI based on content understanding (best for complex fields)
      - **`"classify"`** - Classify values against predefined categories (use with `enum`)
    - **`enum`**: (Optional) Fixed list of possible values for classification
    - **`items`**: (For arrays) Defines structure of array elements
    - **`properties`**: (For objects) Defines nested field structure

- **`config`**: Processing options that control analysis behavior:
  - **`returnDetails`**: Include confidence scores, bounding boxes, metadata (default: false)
  - **`enableOcr`**: Extract text from images/scans (default: true, document only)
  - **`enableLayout`**: Extract layout info like paragraphs, structure (default: true, document only)
  - **`estimateFieldSourceAndConfidence`**: Return source locations and confidence for extracted fields (document only)
  - **`locales`**: Language codes for transcription (audio/video, e.g., `["en-US"]`)
  - **`contentCategories`**: Define categories for classification and segmentation
  - **`enableSegment`**: Split content into categorized chunks (document/video)

- **`models`**: Specifies which AI models to use:
  - **`completion`**: Model for extraction/generation tasks (e.g., `"gpt-4o"`, `"gpt-4o-mini"`)
  - **`embedding`**: Model for embedding tasks when using knowledge bases

For complete details, see the [Analyzer Reference Documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/analyzer-reference).


# Document Analysis

Let's start with document analysis by extracting fields from invoices and receipts. This modality is excellent for processing structured documents and extracting key information like amounts, dates, vendor details, and line items.

## 1. Invoice Field Extraction

Let's extract fields from an invoice PDF. This analyzer identifies essential invoice elements such as vendor information, amounts, dates, and line items.

**Create and Run Invoice Analyzer**

Now let's create the invoice analyzer and process our sample invoice:

In [None]:
import time
invoice_analyzer_id = f"notebooks_sample_invoice_extraction_{int(time.time())}"

invoice_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description": "Sample invoice analyzer that extracts vendor information, line items, and totals from commercial invoices",
    "config": {
        "returnDetails": True,
        "enableOcr": True,
        "enableLayout": True,
        "estimateFieldSourceAndConfidence": True
    },
    "fieldSchema": {
        "name": "InvoiceFields",
        "fields": {
            "VendorName": {
                "type": "string",
                "method": "extract",
                "description": "Name of the vendor or supplier, typically found in the header section"
            },
            "Items": {
                "type": "array",
                "method": "generate",
                "description": "List of items or services on the invoice, typically in a table format",
                "items": {
                    "type": "object",
                    "properties": {
                        "Description": {
                            "type": "string",
                            "description": "Item or service description"
                        },
                        "Amount": {
                            "type": "number",
                            "description": "Line total amount for this item"
                        }
                    }
                }
            }
        }
    },
    "models": {
        "completion": "gpt-4.1"
    }
}
print(f"{json.dumps(invoice_analyzer, indent=2)}")
# Start the analyzer creation operation
response = client.begin_create_analyzer(
    analyzer_id=invoice_analyzer_id,
    analyzer_template=invoice_analyzer,
)

# Wait for the analyzer to be created
print(f"‚è≥ Waiting for analyzer creation to complete...")
client.poll_result(response)
print(f"‚úÖ Analyzer '{invoice_analyzer_id}' created successfully!")

Let's run the custom analyzer with a invoice pdf.

In [None]:
sample_file_path = '../data/invoice.pdf'

# Begin document analysis operation
print(f"üîç Starting document analysis with analyzer '{invoice_analyzer_id}'...")
analysis_response = client.begin_analyze_binary(
    analyzer_id=invoice_analyzer_id,
    file_location=sample_file_path,
)

# Wait for analysis completion
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Document analysis completed successfully!")


**Invoice Analysis Results**

Let's examine the extracted fields from the invoice:

In [None]:
# Display comprehensive results
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])
    
    if contents:
        first_content = contents[0]
        
        # Display extracted fields
        fields = first_content.get("fields", {})
        print("üìä Extracted Fields:")
        print("-" * 80)
        if fields:
            for field_name, field_value in fields.items():
                field_type = field_value.get("type")
                if field_type == "string":
                    print(f"{field_name}: {field_value.get('valueString')}")
                elif field_type == "number":
                    print(f"{field_name}: {field_value.get('valueNumber')}")
                elif field_type == "array":
                    print(f"{field_name} (array with {len(field_value.get('valueArray', []))} items):")
                    for idx, item in enumerate(field_value.get('valueArray', []), 1):
                        if item.get('type') == 'object':
                            print(f"  Item {idx}:")
                            for key, val in item.get('valueObject', {}).items():
                                if val.get('type') == 'string':
                                    print(f"    {key}: {val.get('valueString')}")
                                elif val.get('type') == 'number':
                                    print(f"    {key}: {val.get('valueNumber')}")
                elif field_type == "object":
                    print(f"{field_name}: {field_value.get('valueObject')}")
                print()
        else:
            print("No fields extracted")
        print()
        
        # Display content metadata
        print("üìã Content Metadata:")
        print("-" * 80)
        print(f"Kind: {first_content.get('kind')}")
        print(f"Pages: {first_content.get('startPageNumber')} - {first_content.get('endPageNumber')}")
        print(f"Unit: {first_content.get('unit')}")
        print()
        
    
    # Save full result to file
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"test_output/invoice_analysis_result_{timestamp}.json"
    os.makedirs("test_output", exist_ok=True)
    
    with open(output_file, 'w') as f:
        json.dump(analysis_result, f, indent=2)
    
    print(f"üíæ Full analysis result saved to: {output_file}")
else:
    print("No analysis result available")

**Clean Up Invoice Analyzer**

Clean up the analyzer to manage resources (in production, you would typically keep analyzers for reuse):

In [None]:
# Clean up the created analyzer
print(f"üóëÔ∏è  Deleting analyzer '{invoice_analyzer_id}'...")
client.delete_analyzer(analyzer_id=invoice_analyzer_id)
print(f"‚úÖ Analyzer '{invoice_analyzer_id}' deleted successfully!")

# Summary

üéâ **Congratulations!** You've successfully completed the field extraction tutorial for Azure AI Content Understanding!


## Next Steps

- **Try Other Notebooks**: 
  - `content_extraction.ipynb` - Multi-modal content extraction (audio, video, images)
  - `conversational_field_extraction.ipynb` - Extract fields from audio conversations
  - `management.ipynb` - Advanced analyzer management operations
- **Read the Documentation**: Visit the [Azure AI Content Understanding documentation](https://learn.microsoft.com/azure/ai-services/content-understanding/) for comprehensive guides and API references