-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem
Agents that need to process file attachments (PDFs, images, audio, video) face several limitations today:
Azure Content Understanding (CU) is an Azure AI service that extracts structured content from documents, images, audio, and video using state-of-the-art OCR, transcription, and field extraction. It addresses the following gaps in the Agent Framework:
- Poor OCR / structure extraction — Free digital-text PDF parsers miss scanned content, handwritten text, complex tables, and multi-column layouts. CU provides state-of-the-art OCR with markdown extraction that preserves document structure.
- No multimodal support — Most LLMs don't natively accept audio, video, or rich document formats (DOCX, XLSX, PPTX). Even those that accept images often can't handle multi-page PDFs or long audio. A preprocessing layer is needed to extract structured text from these formats before sending to the LLM.
- No built-in integration — Developers today must write custom code to call CU, manage analysis state across turns, handle timeouts, and format results for the LLM — this is boilerplate that should be handled by a reusable context provider.
Proposed solution
A new optional package that integrates Azure Content Understanding into the Agent Framework as a BaseContextProvider. It:
- Auto-detects file attachments in
Message.contents, sends them to CU for analysis, and injects structured results (markdown + extracted fields) into the LLM context - Supports documents (PDF, DOCX, XLSX, PPTX, HTML), images (JPEG, PNG, TIFF, BMP), audio (WAV, MP3, M4A, FLAC), and video (MP4, MOV, AVI, WebM)
- Works with any LLM client — the extracted markdown/fields are plain text, so any model can consume them
- Provides background processing with configurable timeout for large files
- Optionally integrates with
file_searchtool for token-efficient RAG on large documents - Follows the existing
BaseContextProviderpattern — zero custom wiring needed
Implementation plan
This feature will be implemented in both Python and .NET. Python will be delivered first to gather feedback on the API surface and usage patterns, then .NET will follow.
Python PR: #4829
Code Sample 1 — Multi-turn document Q&A
cu = ContentUnderstandingContextProvider(
endpoint="https://my-resource.services.ai.azure.com/",
credential=AzureCliCredential(),
)
async with cu:
agent = Agent(client=client, name="DocQA", instructions="...", context_providers=[cu])
session = AgentSession()
# Turn 1: Upload PDF — CU extracts markdown + fields, injects into LLM context
response = await agent.run(
Message(role="user", contents=[
Content.from_text("What's on this invoice?"),
Content.from_uri("https://example.com/invoice.pdf", media_type="application/pdf",
additional_properties={"filename": "invoice.pdf"}),
]),
session=session,
)
# Turn 2: Follow-up — no re-upload, CU results cached in session state
response = await agent.run("What is the total amount due?", session=session)Complete samples: 01_document_qa.py · 02_multi_turn_session.py
Code Sample 2 — file_search integration for large documents
# CU extracts markdown → auto-uploads to vector store → file_search tool registered
cu = ContentUnderstandingContextProvider(
endpoint="https://my-resource.services.ai.azure.com/",
credential=credential,
file_search=FileSearchConfig.from_foundry(
openai_client,
vector_store_id=vector_store.id,
file_search_tool=client.get_file_search_tool(vector_store_ids=[vector_store.id]),
),
)Complete sample: 06_large_doc_file_search.py