# Extract Custom Fields from Your File

This notebook demonstrates how to use analyzers to extract custom fields from your input files.

## Analyzer Templates

Below is a collection of analyzer templates designed to extract fields from various input file types.

These templates are highly customizable, allowing you to modify them to suit your specific needs. For additional verified templates from Microsoft, please visit [here](../analyzer_templates/README.md).

In [30]:
extraction_templates = {
    "contrato":            ('../analyzer_templates/contrato.json',         '../data/contrato.pdf'            )
}

Specify the analyzer template you want to use and provide a name for the analyzer to be created based on the template.

In [31]:
import uuid


ANALYZER_TEMPLATE = "contrato"
ANALYZER_ID = "field-extraction-sample-" + str(uuid.uuid4())

(analyzer_template_path, analyzer_sample_file_path) = extraction_templates[ANALYZER_TEMPLATE]

## Create Azure AI Content Understanding Client

> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing functions to interact with the Content Understanding API. Before the official release of the Content Understanding SDK, it can be regarded as a lightweight SDK.


In [32]:
import logging
import json
import os
import sys
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import pandas as pd

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_VERSION = os.getenv("AZURE_AI_API_VERSION", "2024-12-01-preview")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from src.content_understanding_client import AzureContentUnderstandingClient

credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_ENDPOINT,
    api_version=AZURE_AI_API_VERSION,
    token_provider=token_provider,
)

INFO:azure.identity._credentials.environment:No environment configuration found.
INFO:azure.identity._credentials.managed_identity:ManagedIdentityCredential will use IMDS
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
Request method: 'GET'
Request headers:
    'User-Agent': 'azsdk-python-identity/1.23.0 Python/3.11.0rc2 (Windows-10-10.0.26100-SP0)'
No body was attached to the request
INFO:azure.identity._credentials.chained:DefaultAzureCredential acquired a token from AzureCliCredential


## Create Analyzer from the Template

In [33]:
response = client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=analyzer_template_path)

result = client.poll_result(response)

print(json.dumps(result, indent=2))

INFO:src.content_understanding_client:Analyzer field-extraction-sample-d8417119-2c9a-46b1-8eb9-2369af2bace7 create request accepted.
INFO:src.content_understanding_client:Request 4d64282c-7ef6-4c5f-b1bf-1e3bb9f17074 in progress ...
INFO:src.content_understanding_client:Request result is ready after 3.00 seconds.


{
  "id": "4d64282c-7ef6-4c5f-b1bf-1e3bb9f17074",
  "status": "Succeeded",
  "result": {
    "analyzerId": "field-extraction-sample-d8417119-2c9a-46b1-8eb9-2369af2bace7",
    "description": "An\u00c3\u00a1lise de contrato de financiamento habitacional",
    "createdAt": "2025-05-26T23:03:21Z",
    "lastModifiedAt": "2025-05-26T23:03:25Z",
    "config": {
      "returnDetails": true,
      "enableOcr": true,
      "enableLayout": true,
      "enableBarcode": false,
      "enableFormula": false,
      "disableContentFiltering": false
    },
    "fieldSchema": {
      "fields": {
        "DataAssinatura": {
          "type": "string",
          "method": "extract",
          "description": "Data de assinatura do contrato de financiamento habitacional. Base legal: subitem 9.1.6 do RAFCVS."
        },
        "NomeCompletoVendedor": {
          "type": "string",
          "method": "extract",
          "description": "Nome completo do(s) vendedor(es) do im\u00c3\u00b3vel. Subitem 11.5.2 do 

## Extract Fields Using the Analyzer

After the analyzer is successfully created, we can use it to analyze our input files.

In [34]:
response = client.begin_analyze(ANALYZER_ID, file_location=analyzer_sample_file_path)
result = client.poll_result(response)

INFO:src.content_understanding_client:Analyzing file ../data/contrato.pdf with analyzer: field-extraction-sample-d8417119-2c9a-46b1-8eb9-2369af2bace7
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request 3677de0f-34f1-42b0-ba96-34e2baee575b in progress ...
INFO:src.content_understanding_client:Request result is ready after 16.92 seconds.


In [37]:
# helper function to extract from nested JSON
def extract_fields_from_value_array(value_array):
    """
    Extracts fields from a valueArray, returning a dictionary:
    {
        field_name: {
            "value": ...,
            "confidence": ...,
            "source": ...
        },
        ...
    }
    """
    result = {}
    for item in value_array:
        value_object = item.get("valueObject", {})
        for key, field in value_object.items():
            # Determine the value key dynamically
            value = (
                field.get("valueString")
                or field.get("valueNumber")
                or field.get("valueDate")
                or field.get("valueBoolean")
                or field.get("valueArray")
                or field.get("valueObject")
            )
            result[key] = {
                "value": value,
                "confidence": field.get("confidence"),
                "source": field.get("source"),
            }
    return result


In [41]:
data = []

for fields, value in result["result"]["contents"][0]["fields"].items():
    if value.get('type') != 'array':
        fields_values = value.get('valueString') or value.get('valueDate') or value.get('valueNumber') 
        if isinstance(fields_values, str):
            fields_values = fields_values.replace(",", "").replace(".", "").lower()
        confidence = value.get('confidence')
        source = value.get('source', {})
        data.append([fields, fields_values, confidence, source])
    # If the value is a valueArray, extract fields from it
    else:
        value_array = value.get('valueArray', [])
        extracted_fields = extract_fields_from_value_array(value_array)
        data.append([fields, json.dumps(extracted_fields), None, None])  # Append the field name

      
df = pd.DataFrame(data, columns=["Field", "Value", "Confidence", "Source"])
df.to_csv("../results/extracted_fields.csv", index=False)

## Clean Up
Optionally, delete the sample analyzer from your resource. In typical usage scenarios, you would analyze multiple files using the same analyzer.

In [None]:
client.delete_analyzer(ANALYZER_ID)