# Azure AI Content Understanding - Classifier and Analyzer Demo

This notebook demonstrates how to use the Azure AI Content Understanding service to:
1. Create a classifier for document categorization
2. Create a custom analyzer to extract specific fields
3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline

For more detailed information before getting started, please refer to the official documentation:
[Understanding Classifiers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)

## Prerequisites
1. Ensure the Azure AI service is configured by following the [setup steps](../README.md#configure-azure-ai-service-resource).
2. Install the required packages to run this sample.

In [None]:
%pip install -r ../requirements.txt

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that provides functions to interact with the Content Understanding API. Prior to the official release of the Content Understanding SDK, it serves as a lightweight SDK.
>
> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the details from your Azure AI Service.

> ‚ö†Ô∏è Important:
You must update the code below to use your preferred Azure authentication method.
Look for the `# IMPORTANT` comments in the code and modify those sections accordingly.
Skipping this step may cause the sample to not run correctly.

> ‚ö†Ô∏è Note: While using a subscription key is supported, it is strongly recommended to use a token provider with Azure Active Directory (AAD) for enhanced security in production environments.

In [None]:
import logging
import json
import os
import sys
import time
from dotenv import find_dotenv, load_dotenv
# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from content_understanding_client import AzureContentUnderstandingClient
from extension.sample_helper import save_json_to_file 
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# API Configuration
API_VERSION = "2025-11-01"  # GA version

# For authentication, you can use either token-based auth or subscription key; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set it in your ".env" file if not using token authentication
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")

# IMPORTANT: Choose your authentication method
# Option 1: Using Subscription Key (simpler but less secure)
if AZURE_AI_API_KEY:
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        subscription_key=AZURE_AI_API_KEY
    )
    print("‚úÖ AzureContentUnderstandingClient created with subscription key")
else:
    # Option 2: Using Azure AD Token Provider (recommended for production)
    credential = DefaultAzureCredential()
    
    # Create a token provider function that returns the access token
    def get_token():
        token = credential.get_token("https://cognitiveservices.azure.com/.default")
        return token.token
    
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        token_provider=get_token
    )
    print("‚úÖ AzureContentUnderstandingClient created with token provider")

## Configure Model Deployments for Prebuilt Analyzers

> **üí° Note:** This step is only required **once per Azure Content Understanding resource**, unless the GPT deployment has been changed. You can skip this section if:
> - This configuration has already been run once for your resource, or
> - Your administrator has already configured the model deployments for you

Before using prebuilt analyzers, you need to configure the default model deployment mappings. This tells Content Understanding which model deployments to use.

**Model Requirements:**
- **GPT-4.1** - Required for most prebuilt analyzers (e.g., `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`)
- **GPT-4.1-mini** - Required for RAG analyzers (e.g., `prebuilt-documentSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch`)
- **text-embedding-3-large** - Required for all prebuilt analyzers that use embeddings

**Prerequisites:**
1. Deploy **GPT-4.1**, **GPT-4.1-mini**, and **text-embedding-3-large** models in Azure AI Foundry
2. Set `GPT_4_1_DEPLOYMENT`, `GPT_4_1_MINI_DEPLOYMENT`, and `TEXT_EMBEDDING_3_LARGE_DEPLOYMENT` in your `.env` file with the deployment names

In [None]:
# Get model deployment names from environment variables
GPT_4_1_DEPLOYMENT = os.getenv("GPT_4_1_DEPLOYMENT")
GPT_4_1_MINI_DEPLOYMENT = os.getenv("GPT_4_1_MINI_DEPLOYMENT")
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = os.getenv("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

# Check if required deployments are configured
missing_deployments = []
if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ö†Ô∏è  Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print("\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments.")
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print("      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>")
    print("   3. Restart the kernel and run this cell again")
else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}")
    
    try:
        # Update defaults to map model names to your deployments
        result = client.update_defaults({
            "gpt-4.1": GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini": GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large": TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        
        print(f"‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(f"   - One or more deployment names don't exist in your Azure AI Foundry project")
        print(f"   - You don't have permission to update defaults")
        raise


## Create a Basic Classifier
Classify document from URL using begin_classify API.

High-level steps:
1. Create a custom classifier
2. Classify a document from a remote URL
3. Save the classification result to a file
4. Clean up the created classifier

In [None]:
# Generate a unique classifier ID
analyzer_id = f"notebooks_sample_classifier_{int(time.time())}"

# Define the classifier as a dictionary
content_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description": f"Custom classifier for URL classification demo: {analyzer_id}",
    "config": {
        "returnDetails": True,
        "enableSegment": True,
        "contentCategories": {
            "Loan application": {
                "description": "Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation."
            },
            "Invoice": {
                "description": "Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
            },
            "Bank_Statement": {
                "description": "Official statements issued by banks that summarize account activity over a period, including deposits, withdrawals, fees, and balances."
            }
        }
    },
    "models": {"completion": "gpt-4.1"},
    "tags": {"demo_type": "url_classification"}
}

# Create a custom classifier
print(f"üîß Creating custom classifier '{analyzer_id}'...")

# Start the classifier creation operation
response = client.begin_create_analyzer(
    analyzer_id=analyzer_id,
    analyzer_template=content_analyzer,
)

# Wait for the classifier to be created
print(f"‚è≥ Waiting for classifier creation to complete...")
client.poll_result(response)
print(f"‚úÖ Classifier '{analyzer_id}' created successfully!")

## Classify Your Document

Now, use the classifier to categorize your document.

In [None]:
# Read the mixed financial docs PDF file
pdf_path = "../data/mixed_financial_docs.pdf"
print(f"üìÑ Reading document file: {pdf_path}")

# Begin binary classification operation
print(f"üîç Starting binary classification with classifier '{analyzer_id}'...")
analysis_response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=pdf_path,
)

# Wait for analysis completion
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Document analysis completed successfully!")

## View Classification Results

Review the classification results generated for your document.

In [None]:
# Display results
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])
    
    if contents:
        first_content = contents[0]
        
        # Display classification results from segments
        segments = first_content.get("segments", [])
        if segments:
            print("\nüìä Classification Results:")
            print("=" * 50)
            for idx, segment in enumerate(segments, 1):
                print(f"\nSegment {idx}:")
                print(f"   Category: {segment.get('category', 'N/A')}")
                print(f"   Start Page: {segment.get('startPageNumber', 'N/A')}")
                print(f"   End Page: {segment.get('endPageNumber', 'N/A')}")
                print(f"   Segment ID: {segment.get('segmentId', 'N/A')}")
            print("=" * 50)
    else:
        print("No contents available in analysis result")
else:
    print("No analysis result available")

## Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [None]:
# Save the analysis result to a file
saved_file_path = save_json_to_file(analysis_result, filename_prefix="classification_get_result")
# Print the full analysis result as a JSON string
print(json.dumps(analysis_result, indent=2))

## Clean up the created analyzer 
After the demo completes, the classifier is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the created classifier
print(f"üóëÔ∏è  Deleting classifier '{analyzer_id}'...")
client.delete_analyzer(analyzer_id=analyzer_id)
print(f"‚úÖ Classifier '{analyzer_id}' deleted successfully!")

## Create a Custom Analyzer (Advanced)

Create a custom analyzer to extract specific fields from documents.
This example extracts common fields from loan application documents and generates document excerpts.

In [None]:
# Generate a unique analyzer ID for loan applications
loan_analyzer_id = f"notebooks_sample_loan_analyzer_{int(time.time())}"

# Define custom analyzer as a dictionary
custom_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description": "Loan application analyzer - extracts key information from loan applications",
    "config": {
        "returnDetails": True,
        "enableLayout": True,
        "enableFormula": False,
        "estimateFieldSourceAndConfidence": True
    },
    "fieldSchema": {
        "fields": {
            "ApplicationDate": {
                "type": "date",
                "method": "generate",
                "description": "The date when the loan application was submitted."
            },
            "ApplicantName": {
                "type": "string",
                "method": "generate",
                "description": "Full name of the loan applicant or company."
            },
            "LoanAmountRequested": {
                "type": "number",
                "method": "generate",
                "description": "The total loan amount requested by the applicant."
            },
            "LoanPurpose": {
                "type": "string",
                "method": "generate",
                "description": "The stated purpose or reason for the loan."
            },
            "CreditScore": {
                "type": "number",
                "method": "generate",
                "description": "Credit score of the applicant, if available."
            },
            "Summary": {
                "type": "string",
                "method": "generate",
                "description": "A brief summary overview of the loan application details."
            }
        }
    },
    "models": {"completion": "gpt-4.1"},
    "tags": {"demo": "loan-application"}
}

# Create the custom analyzer
print(f"üîß Creating custom analyzer '{loan_analyzer_id}'...")
response = client.begin_create_analyzer(
    analyzer_id=loan_analyzer_id,
    analyzer_template=custom_analyzer,
)
client.poll_result(response)
print(f"‚úÖ Analyzer '{loan_analyzer_id}' created successfully!")

## Create an Enhanced Classifier with Custom Analyzer

Now create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.
This combines document classification with field extraction in one operation.

In [None]:
# Generate a unique enhanced classifier ID
enhanced_classifier_id = f"notebooks_sample_enhanced_classifier_{int(time.time())}"

# Define enhanced classifier with custom analyzer for loan applications
enhanced_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description": f"Enhanced classifier with custom loan analyzer: {enhanced_classifier_id}",
    "config": {
        "returnDetails": True,
        "enableSegment": True,
        "contentCategories": {
            "Loan application": {
                "description": "Documents submitted by individuals or businesses to request funding, typically including personal or business details, financial history, loan amount, purpose, and supporting documentation.",
                "analyzerId": loan_analyzer_id  # Use the custom loan analyzer
            },
            "Invoice": {
                "description": "Billing documents issued by sellers or service providers to request payment for goods or services, detailing items, prices, taxes, totals, and payment terms."
            },
            "Bank_Statement": {
                "description": "Official statements issued by banks that summarize account activity over a period, including deposits, withdrawals, fees, and balances."
            }
        }
    },
    "models": {"completion": "gpt-4.1"},
    "tags": {"demo_type": "enhanced_classification"}
}

# Create the enhanced classifier
print(f"üîß Creating enhanced classifier '{enhanced_classifier_id}'...")
response = client.begin_create_analyzer(
    analyzer_id=enhanced_classifier_id,
    analyzer_template=enhanced_analyzer,
)

# Wait for the classifier to be created
print(f"‚è≥ Waiting for classifier creation to complete...")
client.poll_result(response)
print(f"‚úÖ Enhanced classifier '{enhanced_classifier_id}' created successfully!")

## Process Document with Enhanced Classifier

Process the document again using the enhanced classifier.
Invoices and loan applications will now have additional fields extracted.

In [None]:
pdf_path = "../data/mixed_financial_docs.pdf"
print(f"üìÑ Reading document file: {pdf_path}")

# Begin binary classification operation with enhanced classifier
print(f"üîç Starting binary classification with enhanced classifier '{enhanced_classifier_id}'...")
enhanced_analysis_response = client.begin_analyze_binary(
    analyzer_id=enhanced_classifier_id,
    file_location=pdf_path,
)

# Wait for classification completion
print(f"‚è≥ Waiting for classification to complete...")
enhanced_analysis_result = client.poll_result(enhanced_analysis_response)
print(f"‚úÖ Classification completed successfully!")

## View Enhanced Results with Extracted Fields

Review the classification results alongside extracted fields from loan application documents.

In [None]:
# Display enhanced classification results
if enhanced_analysis_result and "result" in enhanced_analysis_result:
    result = enhanced_analysis_result["result"]
    contents = result.get("contents", [])
    
    if contents:
        print("\nüìä Enhanced Classification Results with Field Extraction:")
        print("=" * 80)
        
        for idx, content_item in enumerate(contents, 1):
            print(f"\nüîñ Segment {idx}:")
            print(f"   Category: {content_item.get('category', 'N/A')}")
            print(f"   Pages: {content_item.get('startPageNumber', 'N/A')} - {content_item.get('endPageNumber', 'N/A')}")
            
            # Display extracted fields if available
            fields = content_item.get("fields", {})
            if fields:
                print(f"\n   üìã Extracted Fields:")
                for field_name, field_value in fields.items():
                    field_type = field_value.get("type")
                    if field_type == "string":
                        print(f"      ‚Ä¢ {field_name}: {field_value.get('valueString')}")
                    elif field_type == "number":
                        print(f"      ‚Ä¢ {field_name}: {field_value.get('valueNumber')}")
                    elif field_type == "date":
                        print(f"      ‚Ä¢ {field_name}: {field_value.get('valueDate')}")
            else:
                print(f"   (No custom fields extracted for this category)")
        
        
        print("\n" + "=" * 80)
        
        # Display document information for the first segment
        first_content = contents[0]
        if first_content.get("kind") == "document":
            print(f"\nüìö Document Information:")
            pages = first_content.get("pages")
            if pages:
                print(f"Total pages in document: {len(pages)}")
                unit = first_content.get("unit", "units")
                print(f"Page dimensions: {pages[0].get('width')} x {pages[0].get('height')} {unit}")
    else:
        print("No contents available in enhanced analysis result")
else:
    print("No enhanced analysis result available")

## Saving Classification Results
The classification result is saved to a JSON file for later analysis.

In [None]:
# Save the enhanced analysis result to a file
saved_file_path = save_json_to_file(enhanced_analysis_result, filename_prefix="enhanced_classification_get_result")
# Print the full analysis result as a JSON string
print(json.dumps(enhanced_analysis_result, indent=2))

## Clean up the created analyzer
After the demo completes, the analyzer is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the custom loan analyzer
print(f"üóëÔ∏è  Deleting analyzer '{loan_analyzer_id}'...")
client.delete_analyzer(analyzer_id=loan_analyzer_id)
print(f"‚úÖ Analyzer '{loan_analyzer_id}' deleted successfully!")

## Clean up the created classifier
After the demo completes, the classifier is automatically deleted to prevent resource accumulation.

In [None]:
# Clean up the enhanced classifier
print(f"üóëÔ∏è  Deleting classifier '{enhanced_classifier_id}'...")
client.delete_analyzer(analyzer_id=enhanced_classifier_id)
print(f"‚úÖ Classifier '{enhanced_classifier_id}' deleted successfully!")