# Move Training Data Across Analyzers

This notebook demonstrates how to reuse training data from an existing analyzer when creating a new analyzer in the same Azure AI Content Understanding resource.

## Overview

When you have an analyzer with training data and want to create a new analyzer using the same labeled examples, you can reference the existing blob storage location without duplicating or moving the data.

### Benefits
- **No data duplication**: Reuse existing training data without copying
- **Same resource**: Both analyzers access the same blob storage
- **Field portability**: Maintain stable `fieldId`s across analyzers
- **Rapid iteration**: Test schema variations quickly

### Prerequisites
1. An existing analyzer with training data already configured
2. Azure AI service configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource)
3. Required packages installed

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class providing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, this acts as a lightweight SDK.

> ‚ö†Ô∏è **Important**: Update the code below to match your Azure authentication method. Look for the `# IMPORTANT` comments and modify those sections accordingly.

> ‚ö†Ô∏è **Note**: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments.

In [4]:
import logging
import json
import os
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based authentication or a subscription key; only one method is required.
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
# IMPORTANT: Replace with your actual subscription key or set it in the ".env" file if not using token authentication.
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2025-05-01-preview")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    # IMPORTANT: Comment out token_provider if using subscription key
    token_provider=token_provider,
    # IMPORTANT: Uncomment this if using subscription key
    # subscription_key=AZURE_AI_API_KEY,
    x_ms_useragent="azure-ai-content-understanding-python/move_training_data",
)

print("‚úÖ Content Understanding client initialized successfully!")

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.25.0 Python/3.11.13 (Linux-6.8.0-1030-azure-x86_64-with-glibc2.41)'
No body was attached to the request
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.25.0 Python/3.11.13 (Linux-6.8.0-1030-azure-x86_64-with-glibc2.41)'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logg

‚úÖ Content Understanding client initialized successfully!


## Step 1: List Available Analyzers

First, let's see what analyzers are available in your resource. We'll look for analyzers that have training data configured.

In [7]:
# Get all analyzers in your resource
all_analyzers = client.get_all_analyzers()
analyzers_list = all_analyzers.get('value', [])

print(f"Found {len(analyzers_list)} analyzer(s) in your resource\n")

# Display analyzer names and IDs
if analyzers_list:
    print("Available analyzers:")
    for idx, analyzer in enumerate(analyzers_list, 1):
        analyzer_id = analyzer.get('analyzerId', 'N/A')
        analyzer_name = analyzer.get('name', 'N/A')
        print(f"{idx}. ID: {analyzer_id}")
        print(f"   Name: {analyzer_name}")
        print()
else:
    print("No analyzers found. Please create an analyzer with training data first.")
    print("See: notebooks/analyzer_training.ipynb for guidance.")

Found 675 analyzer(s) in your resource

Available analyzers:
1. ID: prebuilt-audioAnalyzer
   Name: N/A

2. ID: prebuilt-callCenter
   Name: N/A

3. ID: prebuilt-contract
   Name: N/A

4. ID: prebuilt-documentAnalyzer
   Name: N/A

5. ID: prebuilt-imageAnalyzer
   Name: N/A

6. ID: prebuilt-invoice
   Name: N/A

7. ID: prebuilt-videoAnalyzer
   Name: N/A

8. ID: 123
   Name: N/A

9. ID: Test-description
   Name: N/A

10. ID: Test
   Name: N/A

11. ID: abc
   Name: N/A

12. ID: audio-250808
   Name: N/A

13. ID: auto-highlight-analyzer-1753389013
   Name: N/A

14. ID: auto-highlight-analyzer-1753393121
   Name: N/A

15. ID: auto-highlight-analyzer-1753727044
   Name: N/A

16. ID: auto-highlight-analyzer-1753728638
   Name: N/A

17. ID: auto-highlight-analyzer-1753822646
   Name: N/A

18. ID: auto-highlight-analyzer-1753823934
   Name: N/A

19. ID: auto-highlight-analyzer-1753826664
   Name: N/A

20. ID: auto-highlight-analyzer-1753829625
   Name: N/A

21. ID: auto-highlight-analyzer-175

## Step 2: Select Source Analyzer

Specify the ID of the analyzer whose training data you want to reuse.

**Option 1**: Set `SOURCE_ANALYZER_ID` to an existing analyzer ID from the list above.

**Option 2**: If you don't have an analyzer with training data, uncomment and run the next cell to create one first.

In [8]:
# OPTION 1: Specify an existing analyzer ID that has training data
# Replace this with your actual analyzer ID
SOURCE_ANALYZER_ID = "invoiceLabeledData"

# Uncomment to use the first analyzer from the list
# if analyzers_list:
#     SOURCE_ANALYZER_ID = analyzers_list[0].get('id')
#     print(f"Using first analyzer: {SOURCE_ANALYZER_ID}")

print(f"Source Analyzer ID: {SOURCE_ANALYZER_ID}")

Source Analyzer ID: invoiceLabeledData


### Option 2: Create a Source Analyzer with Training Data (Optional)

If you don't have an existing analyzer with training data, run this cell to create one first.

**Prerequisites**:
- Set environment variables for training data (see [docs/set_env_for_training_data_and_reference_doc.md](../docs/set_env_for_training_data_and_reference_doc.md))
- Ensure you have labeled training data in `../data/document_training/`

In [None]:
# Uncomment this entire cell if you need to create a source analyzer first

# from azure.storage.blob import ContainerSasPermissions

# # Configure training data
# analyzer_template_path = "../analyzer_templates/receipt.json"
# training_docs_folder = "../data/document_training"

# # Get or generate SAS URL
# training_data_sas_url = os.getenv("TRAINING_DATA_SAS_URL")
# if not training_data_sas_url:
#     TRAINING_DATA_STORAGE_ACCOUNT_NAME = os.getenv("TRAINING_DATA_STORAGE_ACCOUNT_NAME")
#     TRAINING_DATA_CONTAINER_NAME = os.getenv("TRAINING_DATA_CONTAINER_NAME")
#     if not TRAINING_DATA_STORAGE_ACCOUNT_NAME:
#         raise ValueError(
#             "Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCOUNT_NAME "
#             "and TRAINING_DATA_CONTAINER_NAME environment variables."
#         )
#     training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(
#         account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,
#         container_name=TRAINING_DATA_CONTAINER_NAME,
#         permissions=ContainerSasPermissions(read=True, write=True, list=True),
#         expiry_hours=1,
#     )

# training_data_path = os.getenv("TRAINING_DATA_PATH")

# # Upload training data to blob storage
# print("Uploading training data to blob storage...")
# await client.generate_training_data_on_blob(training_docs_folder, training_data_sas_url, training_data_path)
# print("‚úÖ Training data uploaded successfully!")

# # Create source analyzer
# SOURCE_ANALYZER_ID = "source-analyzer-" + str(uuid.uuid4())
# print(f"Creating source analyzer: {SOURCE_ANALYZER_ID}")

# response = client.begin_create_analyzer(
#     SOURCE_ANALYZER_ID,
#     analyzer_template_path=analyzer_template_path,
#     training_storage_container_sas_url=training_data_sas_url,
#     training_storage_container_path_prefix=training_data_path,
# )
# result = client.poll_result(response)
# print("‚úÖ Source analyzer created successfully!")
# print(json.dumps(result, indent=2))

## Step 3: Retrieve Source Analyzer Details

Now we'll fetch the complete definition of the source analyzer, including its training data configuration.

In [20]:
# Get detailed information about the source analyzer
source_analyzer = client.get_analyzer_detail_by_id(SOURCE_ANALYZER_ID)

print(f"Source Analyzer: {SOURCE_ANALYZER_ID}")
print(f"Name: {source_analyzer.get('name', 'N/A')}")
print(f"Description: {source_analyzer.get('description', 'N/A')}")
print("\nFull analyzer definition:")
print(json.dumps(source_analyzer, indent=2))

Source Analyzer: invoiceLabeledData
Name: N/A
Description: 

Full analyzer definition:
{
  "analyzerId": "invoiceLabeledData",
  "description": "",
  "tags": {
    "projectId": "d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb",
    "templateId": "document-2025-05-01"
  },
  "createdAt": "2025-10-22T22:03:08Z",
  "lastModifiedAt": "2025-10-22T22:03:11Z",
  "baseAnalyzerId": "prebuilt-documentAnalyzer",
  "config": {
    "returnDetails": true,
    "enableOcr": true,
    "enableLayout": true,
    "enableFormula": false,
    "disableContentFiltering": false,
    "tableFormat": "html",
    "estimateFieldSourceAndConfidence": false
  },
  "fieldSchema": {
    "fields": {
      "CompanyName": {
        "type": "string",
        "method": "extract",
        "description": "Name of the pharmaceutical company involved in the rebate program"
      },
      "ProductDetails": {
        "type": "array",
        "description": "List of products with rebate and unit details",
        "items": {
          "type":

## Step 4: Extract Training Data Configuration

Extract the training data configuration from the source analyzer. This includes:
- **trainingData**: The blob container location with labeled examples
- **fieldSchema**: The field definitions
- **tags**: Project and template metadata (important for Azure AI Foundry project association)

In [21]:
# Extract training data configuration
training_data_config = source_analyzer.get('trainingData')
knowledge_sources_config = source_analyzer.get('knowledgeSources')
field_schema = source_analyzer.get('fieldSchema', {})
tags = source_analyzer.get('tags', {})

print("üì¶ Training Data Configuration:")
if training_data_config:
    print(json.dumps(training_data_config, indent=2))
    container_url = training_data_config.get('containerUrl', 'N/A')
    prefix = training_data_config.get('prefix', '')
    print(f"\n‚úÖ Found training data at: {container_url}")
    print(f"   Path prefix: {prefix}")
else:
    print("‚ö†Ô∏è  No training data found in this analyzer.")
    print("   Please select an analyzer that has training data configured.")

print("\nüìö Knowledge Sources Configuration:")
if knowledge_sources_config:
    print(json.dumps(knowledge_sources_config, indent=2))
else:
    print("No knowledge sources configured (this is normal for standard mode)")

print("\nüìã Field Schema:")
print(json.dumps(field_schema, indent=2))

print("\nüè∑Ô∏è  Tags (Project & Template Metadata):")
if tags:
    print(json.dumps(tags, indent=2))
    project_id = tags.get('projectId')
    template_id = tags.get('templateId')
    if project_id:
        print(f"\n‚úÖ Found Project ID: {project_id}")
    if template_id:
        print(f"‚úÖ Found Template ID: {template_id}")
    print("\nüí° These tags will be copied to ensure the new analyzer appears in the same Azure AI Foundry project.")
else:
    print("No tags found (the new analyzer may not be associated with a Foundry project)")

üì¶ Training Data Configuration:
{
  "containerUrl": "https://staistudiote203841201294.blob.core.windows.net/7c123b64-9378-4fa7-a807-081efa839c00-cu",
  "kind": "blob",
  "prefix": "labelingProjects/d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb/train"
}

‚úÖ Found training data at: https://staistudiote203841201294.blob.core.windows.net/7c123b64-9378-4fa7-a807-081efa839c00-cu
   Path prefix: labelingProjects/d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb/train

üìö Knowledge Sources Configuration:
No knowledge sources configured (this is normal for standard mode)

üìã Field Schema:
{
  "fields": {
    "CompanyName": {
      "type": "string",
      "method": "extract",
      "description": "Name of the pharmaceutical company involved in the rebate program"
    },
    "ProductDetails": {
      "type": "array",
      "description": "List of products with rebate and unit details",
      "items": {
        "type": "object",
        "description": "Details of a single product",
        "properties": {
      

## Step 5: Create New Analyzer with Existing Training Data

Now we'll create a new analyzer that references the same training data. This new analyzer will:
- Use the same blob storage container and path
- Start with the same field schema (you can modify this)
- Have its own unique ID
- **Include the same tags** (projectId and templateId) to ensure it appears in the correct Azure AI Foundry project

### Key Points:
- **Same resource**: Both analyzers are in the same Azure AI resource
- **No data duplication**: The training data stays in one place
- **Same project**: Tags ensure the analyzer appears in the same Foundry project
- **Independent lifecycle**: Each analyzer can be updated or deleted independently

In [39]:
# Verify we have training data before proceeding
if not training_data_config:
    raise ValueError(
        "Cannot proceed: Source analyzer does not have training data. "
        "Please select an analyzer with training data or create one using the optional cell above."
    )

# Create a new analyzer ID
# Analyzer names must be 1-64 characters and only contain letters, numbers, dots, underscores, or hyphens
NEW_ANALYZER_ID = "cloned-analyzer-" + str(uuid.uuid4())

# Build the new analyzer payload in the correct order matching the API structure
# Note: Read-only fields like createdAt, lastModifiedAt, status, etc. are omitted as they're set by the service
new_analyzer_payload = {}

# 1. Analyzer ID (not needed as it's passed separately, but kept for reference)
# new_analyzer_payload["analyzerId"] = NEW_ANALYZER_ID

# 2. Description
new_analyzer_payload["description"] = f"Created from {SOURCE_ANALYZER_ID} with reused training data"

# 3. Tags (projectId and templateId) - IMPORTANT for Foundry project association
if tags:
    new_analyzer_payload["tags"] = tags
    print("‚úÖ Including tags from source analyzer (ensures correct project association in Foundry)")
    print(f"   Project ID: {tags.get('projectId', 'N/A')}")
    print(f"   Template ID: {tags.get('templateId', 'N/A')}")
else:
    print("‚ö†Ô∏è  No tags found in source analyzer - new analyzer may not appear in Foundry project")

# 4. Base Analyzer ID (if present)
if 'baseAnalyzerId' in source_analyzer:
    new_analyzer_payload['baseAnalyzerId'] = source_analyzer['baseAnalyzerId']

# 5. Config settings
if 'config' in source_analyzer:
    new_analyzer_payload['config'] = source_analyzer['config']

# 6. Field Schema
new_analyzer_payload["fieldSchema"] = field_schema

# 7. Training Data - Will be passed separately to begin_create_analyzer()
# Note: We extract the container URL and prefix to pass as separate parameters
training_container_sas_url = training_data_config.get('containerUrl', '')
training_container_prefix = training_data_config.get('prefix', '')

# 8. Knowledge Sources (if present - typically for Pro mode)
# Extract these separately if they exist
pro_mode_container_sas_url = ""
pro_mode_container_prefix = ""
if knowledge_sources_config and isinstance(knowledge_sources_config, list) and len(knowledge_sources_config) > 0:
    # Get the first knowledge source (typically there's only one)
    first_knowledge_source = knowledge_sources_config[0]
    pro_mode_container_sas_url = first_knowledge_source.get('containerUrl', '')
    pro_mode_container_prefix = first_knowledge_source.get('prefix', '')

# 9. Mode (if present)
if 'mode' in source_analyzer:
    new_analyzer_payload['mode'] = source_analyzer['mode']

print(f"\nCreating new analyzer: {NEW_ANALYZER_ID}")
print("\nNew analyzer payload (ordered to match API structure):")
print(json.dumps(new_analyzer_payload, indent=2))

print("\nüì¶ Training data will be configured separately:")
print(f"   Container URL: {training_container_sas_url}")
print(f"   Prefix: {training_container_prefix}")

if pro_mode_container_sas_url:
    print("\nüìö Pro mode reference docs will be configured separately:")
    print(f"   Container URL: {pro_mode_container_sas_url}")
    print(f"   Prefix: {pro_mode_container_prefix}")

‚úÖ Including tags from source analyzer (ensures correct project association in Foundry)
   Project ID: d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb
   Template ID: document-2025-05-01

Creating new analyzer: cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b

New analyzer payload (ordered to match API structure):
{
  "description": "Created from invoiceLabeledData with reused training data",
  "tags": {
    "projectId": "d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb",
    "templateId": "document-2025-05-01"
  },
  "baseAnalyzerId": "prebuilt-documentAnalyzer",
  "config": {
    "returnDetails": true,
    "enableOcr": true,
    "enableLayout": true,
    "enableFormula": false,
    "disableContentFiltering": false,
    "tableFormat": "html",
    "estimateFieldSourceAndConfidence": false
  },
  "fieldSchema": {
    "fields": {
      "CompanyName": {
        "type": "string",
        "method": "extract",
        "description": "Name of the pharmaceutical company involved in the rebate program"
      },


In [40]:
# Create the new analyzer
# Pass training data and knowledge sources as separate parameters
response = client.begin_create_analyzer(
    NEW_ANALYZER_ID,
    analyzer_template=new_analyzer_payload,
    training_storage_container_sas_url=training_container_sas_url,
    training_storage_container_path_prefix=training_container_prefix,
)

result = client.poll_result(response)

if result and result.get('status') == 'Succeeded':
    print(f"‚úÖ Successfully created new analyzer: {NEW_ANALYZER_ID}")
    print("\nCreation result:")
    print(json.dumps(result, indent=2))
else:
    print("‚ö†Ô∏è Analyzer creation encountered an issue.")
    print(json.dumps(result, indent=2))

INFO:python.content_understanding_client:Analyzer cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b create request accepted.
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf12-3156-4a9a-9675-7b85789a8686 in progress ...
INFO:python.content_understanding_client:Request a22ddf1

‚úÖ Successfully created new analyzer: cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b

Creation result:
{
  "id": "a22ddf12-3156-4a9a-9675-7b85789a8686",
  "status": "Succeeded",
  "result": {
    "analyzerId": "cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b",
    "description": "Created from invoiceLabeledData with reused training data",
    "tags": {
      "projectId": "d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb",
      "templateId": "document-2025-05-01"
    },
    "createdAt": "2025-10-22T22:44:56Z",
    "lastModifiedAt": "2025-10-22T22:47:27Z",
    "baseAnalyzerId": "prebuilt-documentAnalyzer",
    "config": {
      "returnDetails": true,
      "enableOcr": true,
      "enableLayout": true,
      "enableFormula": false,
      "disableContentFiltering": false,
      "tableFormat": "html",
      "estimateFieldSourceAndConfidence": false
    },
    "fieldSchema": {
      "fields": {
        "CompanyName": {
          "type": "string",
          "method": "extract",
          "d

## Step 6: Verify the New Analyzer

Let's confirm the new analyzer was created correctly and is using the same training data.

In [41]:
# Get details of the newly created analyzer
new_analyzer = client.get_analyzer_detail_by_id(NEW_ANALYZER_ID)

print(f"New Analyzer: {NEW_ANALYZER_ID}")
print(f"Name: {new_analyzer.get('name', 'N/A')}")
print(f"Description: {new_analyzer.get('description', 'N/A')}")
print("\nTraining Data Configuration:")
print(json.dumps(new_analyzer.get('trainingData', {}), indent=2))

# Verify the training data location matches
new_training_data = new_analyzer.get('trainingData', {})
original_container = training_data_config.get('containerUrl', '')
new_container = new_training_data.get('containerUrl', '')

if original_container == new_container:
    print("\n‚úÖ Verification successful: Both analyzers reference the same training data location!")
else:
    print("\n‚ö†Ô∏è Warning: Training data locations don't match.")
    print(f"Original: {original_container}")
    print(f"New: {new_container}")

New Analyzer: cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b
Name: N/A
Description: Created from invoiceLabeledData with reused training data

Training Data Configuration:
{
  "containerUrl": "https://staistudiote203841201294.blob.core.windows.net/7c123b64-9378-4fa7-a807-081efa839c00-cu",
  "kind": "blob",
  "prefix": "labelingProjects/d7afeaa4-fe05-4df7-bd7c-46f3a94a96cb/train/"
}

‚úÖ Verification successful: Both analyzers reference the same training data location!


## Step 7: Test Both Analyzers

Now let's test both analyzers with a sample file to verify they both work correctly with the shared training data.

In [42]:
# Specify a test file - adjust this path based on your analyzer type
# For receipt analyzers:
test_file = "../data/receipt.png"

# For invoice analyzers:
# test_file = "../data/invoice.pdf"

# For custom documents:
# test_file = "../data/your-document.pdf"

# Verify the file exists
if not Path(test_file).exists():
    print(f"‚ö†Ô∏è Test file not found: {test_file}")
    print("Please adjust the test_file path to match your use case.")
else:
    print(f"Testing with file: {test_file}")

Testing with file: ../data/receipt.png


In [43]:
# Test the original analyzer
if Path(test_file).exists():
    print(f"\nüìù Analyzing with SOURCE analyzer: {SOURCE_ANALYZER_ID}")
    response_source = client.begin_analyze(SOURCE_ANALYZER_ID, file_location=test_file)
    result_source = client.poll_result(response_source)
    
    print("\nSource Analyzer Results:")
    # Print a summary of extracted fields
    if result_source.get('status') == 'Succeeded':
        result_data = result_source.get('result', {})
        fields = result_data.get('contents', [{}])[0].get('fields', {})
        print(f"Extracted {len(fields)} field(s)")
        for field_name, field_value in fields.items():
            print(f"  - {field_name}: {field_value}")
    else:
        print(json.dumps(result_source, indent=2))


üìù Analyzing with SOURCE analyzer: invoiceLabeledData


INFO:python.content_understanding_client:Analyzing file ../data/receipt.png with analyzer: invoiceLabeledData
INFO:python.content_understanding_client:Request 80b00372-a498-4564-9ff1-1e6901778a2d in progress ...
INFO:python.content_understanding_client:Request 80b00372-a498-4564-9ff1-1e6901778a2d in progress ...
INFO:python.content_understanding_client:Request 80b00372-a498-4564-9ff1-1e6901778a2d in progress ...
INFO:python.content_understanding_client:Request 80b00372-a498-4564-9ff1-1e6901778a2d in progress ...
INFO:python.content_understanding_client:Request result is ready after 4.71 seconds.
INFO:python.content_understanding_client:Request result is ready after 4.71 seconds.



Source Analyzer Results:
Extracted 3 field(s)
  - CompanyName: {'type': 'string', 'valueString': 'Contoso'}
  - ProductDetails: {'type': 'array'}
  - TotalPaid: {'type': 'number', 'valueNumber': 2516.28}


In [44]:
# Test the new analyzer
if Path(test_file).exists():
    print(f"\nüìù Analyzing with NEW analyzer: {NEW_ANALYZER_ID}")
    response_new = client.begin_analyze(NEW_ANALYZER_ID, file_location=test_file)
    result_new = client.poll_result(response_new)
    
    print("\nNew Analyzer Results:")
    # Print a summary of extracted fields
    if result_new.get('status') == 'Succeeded':
        result_data = result_new.get('result', {})
        fields = result_data.get('contents', [{}])[0].get('fields', {})
        print(f"Extracted {len(fields)} field(s)")
        for field_name, field_value in fields.items():
            print(f"  - {field_name}: {field_value}")
    else:
        print(json.dumps(result_new, indent=2))
    
    print("\n‚úÖ Both analyzers successfully processed the file using the shared training data!")


üìù Analyzing with NEW analyzer: cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b


INFO:python.content_understanding_client:Analyzing file ../data/receipt.png with analyzer: cloned-analyzer-c073f24d-5659-42ed-8ac8-b083bde79a9b
INFO:python.content_understanding_client:Request 5d982b83-4b1c-4e99-b045-48e36cb5a7e3 in progress ...
INFO:python.content_understanding_client:Request 5d982b83-4b1c-4e99-b045-48e36cb5a7e3 in progress ...
INFO:python.content_understanding_client:Request 5d982b83-4b1c-4e99-b045-48e36cb5a7e3 in progress ...
INFO:python.content_understanding_client:Request 5d982b83-4b1c-4e99-b045-48e36cb5a7e3 in progress ...
INFO:python.content_understanding_client:Request result is ready after 4.72 seconds.
INFO:python.content_understanding_client:Request result is ready after 4.72 seconds.



New Analyzer Results:
Extracted 3 field(s)
  - CompanyName: {'type': 'string', 'valueString': 'Contoso'}
  - ProductDetails: {'type': 'array'}
  - TotalPaid: {'type': 'number', 'valueNumber': 2516.28}

‚úÖ Both analyzers successfully processed the file using the shared training data!


## Step 8: Compare Results (Optional)

Let's compare the full results from both analyzers side by side.

In [None]:
if Path(test_file).exists():
    print("=" * 80)
    print("SOURCE ANALYZER FULL RESULTS")
    print("=" * 80)
    print(json.dumps(result_source, indent=2))
    
    print("\n" + "=" * 80)
    print("NEW ANALYZER FULL RESULTS")
    print("=" * 80)
    print(json.dumps(result_new, indent=2))

## Step 9: Cleanup (Optional)

If you want to clean up the test analyzers, you can delete them. In production, you typically keep analyzers for reuse.

‚ö†Ô∏è **Warning**: This will permanently delete the analyzer. The training data in blob storage will remain unaffected.

In [None]:
# Uncomment to delete the new analyzer
# print(f"Deleting new analyzer: {NEW_ANALYZER_ID}")
# client.delete_analyzer(NEW_ANALYZER_ID)
# print("‚úÖ New analyzer deleted")

# Uncomment to also delete the source analyzer (be careful!)
# print(f"Deleting source analyzer: {SOURCE_ANALYZER_ID}")
# client.delete_analyzer(SOURCE_ANALYZER_ID)
# print("‚úÖ Source analyzer deleted")

## Summary

üéâ **Congratulations!** You have successfully:

‚úÖ Retrieved an existing analyzer with training data  
‚úÖ Extracted the training data configuration  
‚úÖ Created a new analyzer referencing the same training data  
‚úÖ Verified both analyzers work correctly  
‚úÖ Tested both analyzers with a sample file  

### Key Takeaways

- **No data duplication**: Both analyzers reference the same blob storage location
- **Same resource**: Both analyzers use the same authentication and access permissions
- **Field portability**: You can maintain stable `fieldId`s across different analyzer versions
- **Rapid iteration**: Test schema changes quickly without re-uploading training data

### Best Practices

1. **Stable field IDs**: Keep `fieldId`s consistent across analyzers for easier migration
2. **Version control**: Maintain analyzer schemas in source control
3. **Documentation**: Document which blob paths contain which training datasets
4. **Testing**: Always test a new analyzer before deleting the original
5. **Naming conventions**: Use descriptive analyzer IDs that indicate purpose and version

### Next Steps

- Modify the field schema in the new analyzer to test different configurations
- Add additional training data to improve both analyzers
- Use this pattern to create A/B testing scenarios
- Explore other notebooks:
  - [analyzer_training.ipynb](./analyzer_training.ipynb) - Create analyzers with training data
  - [field_extraction.ipynb](./field_extraction.ipynb) - Extract fields from documents
  - [management.ipynb](./management.ipynb) - Manage analyzer lifecycle