Skip to content

.NET: [Feature]: Azure Content Understanding context provider for multimodal document analysis #4942

@yungshinlintw

Description

@yungshinlintw

Problem

Agents that need to process file attachments (PDFs, images, audio, video) face several limitations today:

Azure Content Understanding (CU) is an Azure AI service that extracts structured content from documents, images, audio, and video using state-of-the-art OCR, transcription, and field extraction. It addresses the following gaps in the Agent Framework:

  1. Poor OCR / structure extraction — Free digital-text PDF parsers miss scanned content, handwritten text, complex tables, and multi-column layouts. CU provides state-of-the-art OCR with markdown extraction that preserves document structure.
  2. No multimodal support — Most LLMs don't natively accept audio, video, or rich document formats (DOCX, XLSX, PPTX). Even those that accept images often can't handle multi-page PDFs or long audio. A preprocessing layer is needed to extract structured text from these formats before sending to the LLM.
  3. No built-in integration — Developers today must write custom code to call CU, manage analysis state across turns, handle timeouts, and format results for the LLM — this is boilerplate that should be handled by a reusable context provider.

Proposed solution

A new optional package that integrates Azure Content Understanding into the Agent Framework as a BaseContextProvider. It:

  • Auto-detects file attachments in Message.contents, sends them to CU for analysis, and injects structured results (markdown + extracted fields) into the LLM context
  • Supports documents (PDF, DOCX, XLSX, PPTX, HTML), images (JPEG, PNG, TIFF, BMP), audio (WAV, MP3, M4A, FLAC), and video (MP4, MOV, AVI, WebM)
  • Works with any LLM client — the extracted markdown/fields are plain text, so any model can consume them
  • Provides background processing with configurable timeout for large files
  • Optionally integrates with file_search tool for token-efficient RAG on large documents
  • Follows the existing BaseContextProvider pattern — zero custom wiring needed

Implementation plan

This feature will be implemented in both Python and .NET. Python will be delivered first to gather feedback on the API surface and usage patterns, then .NET will follow.

Python PR: #4829

Code Sample 1 — Multi-turn document Q&A

cu = ContentUnderstandingContextProvider(
    endpoint="https://my-resource.services.ai.azure.com/",
    credential=AzureCliCredential(),
)

async with cu:
    agent = Agent(client=client, name="DocQA", instructions="...", context_providers=[cu])
    session = AgentSession()

    # Turn 1: Upload PDF — CU extracts markdown + fields, injects into LLM context
    response = await agent.run(
        Message(role="user", contents=[
            Content.from_text("What's on this invoice?"),
            Content.from_uri("https://example.com/invoice.pdf", media_type="application/pdf",
                             additional_properties={"filename": "invoice.pdf"}),
        ]),
        session=session,
    )

    # Turn 2: Follow-up — no re-upload, CU results cached in session state
    response = await agent.run("What is the total amount due?", session=session)

Complete samples: 01_document_qa.py · 02_multi_turn_session.py

Code Sample 2 — file_search integration for large documents

# CU extracts markdown → auto-uploads to vector store → file_search tool registered
cu = ContentUnderstandingContextProvider(
    endpoint="https://my-resource.services.ai.azure.com/",
    credential=credential,
    file_search=FileSearchConfig.from_foundry(
        openai_client,
        vector_store_id=vector_store.id,
        file_search_tool=client.get_file_search_tool(vector_store_ids=[vector_store.id]),
    ),
)

Complete sample: 06_large_doc_file_search.py

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions