# Enhance your analyzer with labeled data


> #################################################################################
>
> Note: Currently this feature is only available for analyzer scenario is `document`
>
> #################################################################################

Labeled data is a group of samples that have been tagged with one or more labels to add context or meaning, which is used to improve analyzer's performance.

In your own project, you will use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to use the labeling tool to annotate your data.

In this notebook we will demonstrate after you have the labeled data, how to create analyzer with them and analyze your files.



## Prerequisites
1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)
1. Follow steps in [Set labeled data](../docs/set_env_for_labeled_data.md) to add training data related env variables in `.env`.
1. Install packages needed to run the sample




In [None]:
%pip install -r ../requirements.txt

## Analyzer template and local training folder set up
In this sample we define a template for receipts.

The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:
- The original file (e.g., PDF or image).
- A corresponding labels.json file with labeled fields.
- A corresponding result.json file with OCR results.

In [None]:
analyzer_template = "../analyzer_templates/receipt.json"
training_docs_folder = "../data/document_training"

## Create Azure content understanding client
> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.

> ⚠️ Important:
You must update the code below to match your Azure authentication method.
Look for the `# IMPORTANT` comments and modify those sections accordingly.
If you skip this step, the sample may not run correctly.

> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments.

In [None]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# import utility package from python samples root directory
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=os.getenv("AZURE_AI_ENDPOINT"),
    api_version=os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview"),
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=os.getenv("AZURE_AI_API_KEY"),
    x_ms_useragent="azure-ai-content-understanding-python/analyzer_training", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

## Prepare labeled data
In this step, we will 
- Check whether document files in local folder have corresponding `.labels.json` and `.result.json` files
- Upload these files to the designated Azure blob storage.

We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set in the Prerequisites step.

In [None]:
TRAINING_DATA_SAS_URL = os.getenv("TRAINING_DATA_SAS_URL")
TRAINING_DATA_PATH = os.getenv("TRAINING_DATA_PATH")

await client.generate_training_data_on_blob(training_docs_folder, TRAINING_DATA_SAS_URL, TRAINING_DATA_PATH)

## Create analyzer with defined schema
Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.

We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set up in the `.env` file and used in the previous step.

In [None]:
import uuid
CUSTOM_ANALYZER_ID = "train-sample-" + str(uuid.uuid4())

response = client.begin_create_analyzer(
    CUSTOM_ANALYZER_ID,
    analyzer_template_path=analyzer_template,
    training_storage_container_sas_url=TRAINING_DATA_SAS_URL,
    training_storage_container_path_prefix=TRAINING_DATA_PATH,
)
result = client.poll_result(response)
if result is not None and "status" in result and result["status"] == "Succeeded":
    logging.info(f"Analyzer details for {result['result']['analyzerId']}")
    logging.info(json.dumps(result, indent=2))
else:
    logging.warning(
        "An issue was encountered when trying to create the analyzer. "
        "Please double-check your deployment and configurations for potential problems."
    )

## Use created analyzer to extract document content
After the analyzer is successfully created, we can use it to analyze our input files.

In [None]:
response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location='../data/receipt.png')
result_json = client.poll_result(response)

logging.info(json.dumps(result_json, indent=2))

## Delete exist analyzer in Content Understanding Service
This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse.

In [None]:
client.delete_analyzer(CUSTOM_ANALYZER_ID)