# Enhance Your Analyzer with Labeled Data


> #################################################################################
>
> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.
>
> #################################################################################

Labeled data consists of samples that have been tagged with one or more labels to add context or meaning. This additional information is used to improve the analyzer's performance.

In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.

This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.


## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Set environment variables related to training data by following the steps in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and adding them to the [.env](./.env) file.
   - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,
   - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically during later steps.
   - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.
3. Install the packages required to run the sample:


In [None]:
%pip install -r ../requirements.txt

## Analyzer Template and Local Training Folder Setup
In this sample, we define a template for receipts.

The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:
- The original file (e.g., PDF or image).
- A corresponding `labels.json` file with labeled fields.
- A corresponding `result.json` file with OCR results.

In [None]:
training_docs_folder = "../data/document_training"

## Create Azure Content Understanding Client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that contains helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.
>
> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ‚ö†Ô∏è Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ‚ö†Ô∏è Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments.

In [None]:
from datetime import datetime
import logging
import json
import os
import sys
import time
import uuid
from typing import Any, Optional
from dotenv import find_dotenv, load_dotenv
from azure.storage.blob import ContainerSasPermissions
# Add the parent directory to the Python path to import the sample_helper module
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from content_understanding_client import AzureContentUnderstandingClient
from extension.document_processor import DocumentProcessor
from extension.sample_helper import save_json_to_file 
from azure.identity import DefaultAzureCredential

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key; only one is required
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set it in your ".env" file if not using token authentication
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
API_VERSION = "2025-11-01"

# Create token provider for Azure AD authentication
def token_provider():
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")
    return token.token

# Create the Content Understanding client
try:
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        subscription_key=AZURE_AI_API_KEY,
        token_provider=token_provider if not AZURE_AI_API_KEY else None,
        x_ms_useragent="azure-ai-content-understanding-python-sample-ga"    # The user agent is used for tracking sample usage and does not provide identity information. You can change this if you want to opt out of tracking.
    )
    credential_type = "Subscription Key" if AZURE_AI_API_KEY else "Azure AD Token"
    print(f"‚úÖ Client created successfully")
    print(f"   Endpoint: {AZURE_AI_ENDPOINT}")
    print(f"   Credential: {credential_type}")
    print(f"   API Version: {API_VERSION}")
except Exception as e:
    credential_type = "Subscription Key" if AZURE_AI_API_KEY else "Azure AD Token"
    print(f"‚ùå Failed to create client")
    print(f"   Endpoint: {AZURE_AI_ENDPOINT}")
    print(f"   Credential: {credential_type}")
    print(f"   Error: {e}")
    raise

try:
    processor = DocumentProcessor(client)
    print("‚úÖ DocumentProcessor created successfully")
except Exception as e:
    print(f"‚ùå Failed to create DocumentProcessor: {e}")
    raise

## Configure Model Deployments for Prebuilt Analyzers

> **üí° Note:** This step is only required **once per Azure Content Understanding resource**, unless the GPT deployment has been changed. You can skip this section if:
> - This configuration has already been run once for your resource, or
> - Your administrator has already configured the model deployments for you

Before using prebuilt analyzers, you need to configure the default model deployment mappings. This tells Content Understanding which model deployments to use.

**Model Requirements:**
- **GPT-4.1** - Required for most prebuilt analyzers (e.g., `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`)
- **GPT-4.1-mini** - Required for RAG analyzers (e.g., `prebuilt-documentSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch`)
- **text-embedding-3-large** - Required for all prebuilt analyzers that use embeddings

**Prerequisites:**
1. Deploy **GPT-4.1**, **GPT-4.1-mini**, and **text-embedding-3-large** models in Azure AI Foundry
2. Set `GPT_4_1_DEPLOYMENT`, `GPT_4_1_MINI_DEPLOYMENT`, and `TEXT_EMBEDDING_3_LARGE_DEPLOYMENT` in your `.env` file with the deployment names

In [None]:
# Get model deployment names from environment variables
GPT_4_1_DEPLOYMENT = os.getenv("GPT_4_1_DEPLOYMENT")
GPT_4_1_MINI_DEPLOYMENT = os.getenv("GPT_4_1_MINI_DEPLOYMENT")
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = os.getenv("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

# Check if required deployments are configured
missing_deployments = []
if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ö†Ô∏è  Warning: Missing required model deployment configuration(s):")
    for deployment in missing_deployments:
        print(f"   - {deployment}")
    print("\n   Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments.")
    print("   Please:")
    print("   1. Deploy all three models in Azure AI Foundry")
    print("   2. Add the following to notebooks/.env:")
    print("      GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>")
    print("      GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>")
    print("      TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>")
    print("   3. Restart the kernel and run this cell again")
else:
    print(f"üìã Configuring default model deployments...")
    print(f"   GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}")
    print(f"   GPT-4.1-mini deployment: {GPT_4_1_MINI_DEPLOYMENT}")
    print(f"   text-embedding-3-large deployment: {TEXT_EMBEDDING_3_LARGE_DEPLOYMENT}")
    
    try:
        # Update defaults to map model names to your deployments
        result = client.update_defaults({
            "gpt-4.1": GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini": GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large": TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        
        print(f"‚úÖ Default model deployments configured successfully")
        print(f"   Model mappings:")
        for model, deployment in result.get("modelDeployments", {}).items():
            print(f"     {model} ‚Üí {deployment}")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        print(f"   This may happen if:")
        print(f"   - One or more deployment names don't exist in your Azure AI Foundry project")
        print(f"   - You don't have permission to update defaults")
        raise


## Prepare Labeled Data
In this step, we will:
- Use the environment variables `TRAINING_DATA_PATH` and SAS URL related variables set in the Prerequisites step.
- Attempt to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.
- If `TRAINING_DATA_SAS_URL` is not set, try generating it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.
- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.
- Upload these files to the Azure Blob storage container specified by the environment variables.

In [None]:
# Load reference storage configuration from environment
training_data_path = os.getenv("TRAINING_DATA_PATH") or f"training_data_{uuid.uuid4().hex[:8]}"
training_data_sas_url = os.getenv("TRAINING_DATA_SAS_URL")

print(f"üìã Configuration:")
print(f"   Training Data Path: {training_data_path}")
print(f"   Training Data SAS URL: {'<set>' if training_data_sas_url else '<not set>'}")

if not training_data_path.endswith("/"):
    training_data_path += "/"

if not training_data_sas_url:
    training_data_storage_account_name = os.getenv("TRAINING_DATA_STORAGE_ACCOUNT_NAME")
    training_data_container_name = os.getenv("TRAINING_DATA_CONTAINER_NAME")
    
    print(f"   Storage Account Name: {training_data_storage_account_name or '<not set>'}")
    print(f"   Container Name: {training_data_container_name or '<not set>'}")

    if training_data_storage_account_name and training_data_container_name:
        print(f"\nüîë Generating SAS URL...")
        # We require "Write" permission to upload, modify, or append blobs
        training_data_sas_url = processor.generate_container_sas_url(
            account_name=training_data_storage_account_name,
            container_name=training_data_container_name,
            permissions=ContainerSasPermissions(read=True, write=True, list=True),
            expiry_hours=1,
        )
        print(f"‚úÖ SAS URL generated successfully")
    else:
        print(f"\n‚ö†Ô∏è  Warning: Storage account name or container name not set. Cannot generate SAS URL.")

if training_data_sas_url:
    print(f"\nüì§ Uploading training data from '{training_docs_folder}'...")
    
    # The generate_training_data_on_blob method is async, so we need to run it in an event loop
    import asyncio
    
    # For Jupyter notebooks, we need to handle the event loop properly
    try:
        # Try to get the current event loop
        loop = asyncio.get_event_loop()
        if loop.is_running():
            # We're in a Jupyter notebook with a running loop
            # Use asyncio.ensure_future and wait for it
            task = asyncio.ensure_future(processor.generate_training_data_on_blob(
                training_docs_folder, training_data_sas_url, training_data_path))
            # Wait for the task to complete
            await task
        else:
            # No running loop, use asyncio.run()
            asyncio.run(processor.generate_training_data_on_blob(
                training_docs_folder, training_data_sas_url, training_data_path))
    except RuntimeError:
        # No event loop exists, create one
        asyncio.run(processor.generate_training_data_on_blob(
            training_docs_folder, training_data_sas_url, training_data_path))
    
    print(f"‚úÖ Training data upload completed!")
else:
    print(f"\n‚ùå Error: No SAS URL available. Please set TRAINING_DATA_SAS_URL or provide storage account credentials.")

## Create Analyzer with Defined Schema
Before creating the analyzer, fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.

We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** as set in the [.env](./.env) file and used in the previous step.

In [None]:
analyzer_id = f"notebooks_sample_analyzer_training_{int(time.time())}"

# Build knowledge sources if we have training data
knowledge_sources = None
if training_data_sas_url and training_data_path:
    print(f"üìö Configuring knowledge sources with labeled training data...")
    print(f"   Container SAS URL: <provided>")
    print(f"   Storage Prefix: {training_data_path}")
    
    # Build knowledge source configuration
    knowledge_source_config = {
        "kind": "labeledData",
        "containerUrl": training_data_sas_url,
        "prefix": training_data_path
    }
    
    # Optionally add file list path if specified
    file_list_path = os.getenv("CONTENT_UNDERSTANDING_FILE_LIST_PATH", "")
    if file_list_path:
        knowledge_source_config["fileListPath"] = file_list_path
        print(f"   File List Path: {file_list_path}")
    
    knowledge_sources = [knowledge_source_config]
    print(f"‚úÖ Knowledge source configured")
else:
    print(f"‚ö†Ô∏è  No training data available - creating analyzer without knowledge sources")

# Define the analyzer as a dictionary
content_analyzer = {
    "baseAnalyzerId": "prebuilt-document",
    "description": "Extract useful information from receipt with labeled training data",
    "config": {
        "returnDetails": True,
        "enableLayout": True,
        "enableFormula": False,
        "estimateFieldSourceAndConfidence": True
    },
    "fieldSchema": {
        "name": "receipt schema",
        "description": "Schema for receipt",
        "fields": {
            "MerchantName": {
                "type": "string",
                "method": "extract",
                "description": "Name of the merchant"
            },
            "Items": {
                "type": "array",
                "method": "generate",
                "description": "List of items purchased",
                "items": {
                    "type": "object",
                    "method": "extract",
                    "description": "Individual item details",
                    "properties": {
                        "Quantity": {
                            "type": "string",
                            "method": "extract",
                            "description": "Quantity of the item"
                        },
                        "Name": {
                            "type": "string",
                            "method": "extract",
                            "description": "Name of the item"
                        },
                        "Price": {
                            "type": "string",
                            "method": "extract",
                            "description": "Price of the item"
                        }
                    }
                }
            },
            "TotalPrice": {
                "type": "string",
                "method": "extract",
                "description": "Total price on the receipt"
            }
        }
    },
    "tags": {"demo_type": "analyzer_training"},
    "models": {
        "completion": "gpt-4.1",
        "embedding": "text-embedding-3-large"  # Required when using knowledge sources
    }
}

# Add knowledge sources if available
if knowledge_sources:
    content_analyzer["knowledgeSources"] = knowledge_sources

print(f"\nüîß Creating custom analyzer '{analyzer_id}'...")
print(f"   With knowledge sources: {'Yes' if knowledge_sources else 'No'}")

response = client.begin_create_analyzer(
    analyzer_id=analyzer_id,
    analyzer_template=content_analyzer,
)

# Wait for the analyzer to be created
print(f"‚è≥ Waiting for analyzer creation to complete...")
client.poll_result(response)
print(f"‚úÖ Analyzer '{analyzer_id}' created successfully!")

## Use Created Analyzer to Extract Document Content
After the analyzer is successfully created, you can use it to analyze your input files.

In [None]:
file_path = "../data/receipt.png"
print(f"üìÑ Reading document file: {file_path}")

# Begin document analysis operation
print(f"üîç Starting document analysis with analyzer '{analyzer_id}'...")
analysis_response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=file_path,
)

# Wait for analysis completion
print(f"‚è≥ Waiting for document analysis to complete...")
analysis_result = client.poll_result(analysis_response)
print(f"‚úÖ Document analysis completed successfully!")

# Display results
if analysis_result and "result" in analysis_result:
    result = analysis_result["result"]
    contents = result.get("contents", [])
    
    if contents:
        first_content = contents[0]
        
        # Display markdown content
        print("\nüìÑ Markdown Content:")
        print("=" * 50)
        markdown = first_content.get("markdown", "")
        print(markdown[:500] + "..." if len(markdown) > 500 else markdown)
        print("=" * 50)
        
        # Display extracted fields
        print(f"\nüìä Analyzer Training Results:")
        fields = first_content.get("fields", {})
        if fields:
            for field_name, field_value in fields.items():
                field_type = field_value.get("type")
                print(f"\n{field_name}:")
                if field_type == "string":
                    print(f"  Value: {field_value.get('valueString')}")
                elif field_type == "number":
                    print(f"  Value: {field_value.get('valueNumber')}")
                elif field_type == "array":
                    print(f"  Array with {len(field_value.get('valueArray', []))} items:")
                    for idx, item in enumerate(field_value.get('valueArray', []), 1):
                        if item.get('type') == 'object':
                            print(f"    Item {idx}:")
                            for key, val in item.get('valueObject', {}).items():
                                if val.get('type') == 'string':
                                    print(f"      {key}: {val.get('valueString')}")
                                elif val.get('type') == 'number':
                                    print(f"      {key}: {val.get('valueNumber')}")
        else:
            print("No fields extracted")
        
        # Display content metadata
        print(f"\nüìã Content Metadata:")
        print(f"   Category: {first_content.get('category', 'N/A')}")
        print(f"   Start Page Number: {first_content.get('startPageNumber', 'N/A')}")
        print(f"   End Page Number: {first_content.get('endPageNumber', 'N/A')}")
        
        # Check if this is document content to access document-specific properties
        if first_content.get("kind") == "document":
            print(f"\nüìö Document Information:")
            start_page = first_content.get("startPageNumber", 0)
            end_page = first_content.get("endPageNumber", 0)
            print(f"Start page: {start_page}")
            print(f"End page: {end_page}")
            print(f"Total pages: {end_page - start_page + 1}")

            # Check for pages
            pages = first_content.get("pages")
            if pages:
                print(f"\nüìÑ Pages ({len(pages)}):")
                for page in pages:
                    unit = first_content.get("unit", "units")
                    print(f"  Page {page.get('pageNumber')}: {page.get('width')} x {page.get('height')} {unit}")

            # Check if there are tables in the document
            tables = first_content.get("tables")
            if tables:
                print(f"\nüìä Tables ({len(tables)}):")
                for idx, table in enumerate(tables, 1):
                    row_count = table.get("rowCount", 0)
                    col_count = table.get("columnCount", 0)
                    print(f"  Table {idx}: {row_count} rows x {col_count} columns")
        else:
            print("\nüìö Document Information: Not available for this content type")
    else:
        print("No contents available in analysis result")
    
    # Save the analysis result to a file
    saved_file_path = save_json_to_file(analysis_result, filename_prefix="analyzer_training_result")
    # Print the full analysis result as a JSON string
    print(json.dumps(analysis_result, indent=2))
else:
    print("No analysis result available")

## Delete Existing Analyzer in Content Understanding Service
This snippet is optional and is included to prevent test analyzers from remaining in your service. Without deletion, the analyzer will stay in your service and may be reused in subsequent operations.

In [None]:
print(f"üóëÔ∏è  Deleting analyzer '{analyzer_id}'...")
client.delete_analyzer(analyzer_id=analyzer_id)
print(f"‚úÖ Analyzer '{analyzer_id}' deleted successfully!")