# Context Optimization with MCP and Code Execution

----

This notebook explores **context optimization strategies** for AI agents. We compare **anti-patterns** (inefficient approaches) with **best practices** to demonstrate how to build **more efficient and scalable AI agents**.

You will learn:

- **Part 1: Warming Up** - MCP fundamentals and understanding your model's context window
- **Part 2: Context Optimization** - Anti-patterns vs Best practices across four key areas:
  1. Single Code Execution (full data vs sandbox)
  2. Multi-Step Workflows (multiple trips vs single pipeline)
  3. MCP Tool Management (schema bloat vs tool filtering)
  4. Context Compression (growing history vs compressed context)

## Table of Contents

- [What is MCP?](#what-is-mcp)
- [The Problem: Context Window Overload](#the-problem-context-window-overload)
- [Setup](#setup)
- [Part 1: Warming Up](#part-1-warming-up)
  - 1.1 MCPStdioTool - Local MCP Server
  - 1.2 MCPStreamableHTTPTool - Remote HTTP Server
  - 1.3 Understanding Context Window Limits
- [Part 2: Context Optimization](#part-2-context-optimization)
  - **Section 1: Single Code Execution**
    - 1.1 [Anti-Pattern] Full Data in Context
    - 1.2 [Best Practice] Data Generation in Sandbox
  - **Section 2: Multi-Step Code Execution**
    - 2.1 [Anti-Pattern] Complex Workflows (4 Round Trips)
    - 2.2 [Best Practice] Multi-Step Pipeline in Single Execution
  - **Section 3: MCP Optimization**
    - 3.1 [Anti-Pattern] Remote MCP Schema Bloat
    - 3.2 [Best Practice] Tool Filtering + History Limit
  - **Section 4: Context Compression**
    - 4.1 [Anti-Pattern] Inappropriate Large Context
    - 4.2 [Best Practice] Context Compression
- [Best Practices Summary](#best-practices-summary)
- [Wrap-up](#wrap-up)

## What is MCP?

### Model Context Protocol (MCP)

The **Model Context Protocol (MCP)** is an open standard that enables AI agents to seamlessly connect to external systems, tools, and data sources. Think of it as a **universal adapter** for AI agents.

```
MCP = Standardized Protocol for AI ‚Üî External Systems
```

### What MCP Connects To

| Category | Examples |
|----------|----------|
| Cloud Services | Google Drive, Salesforce, AWS, Azure |
| Databases | PostgreSQL, MongoDB, Redis, SQLite |
| APIs | REST APIs, GraphQL endpoints |
| File Systems | Local and remote file storage |
| Development Tools | GitHub, Jira, Slack |

### MCP Tool Types in Microsoft Agent Framework

| Tool Type | Connection Method | Use Case |
|-----------|------------------|----------|
| `MCPStdioTool` | Standard I/O (local process) | Local MCP servers (e.g., calculator, filesystem) |
| `MCPStreamableHTTPTool` | HTTP with SSE | Remote HTTP-based MCP servers |
| `MCPWebsocketTool` | WebSocket | Real-time bidirectional communication |

## The Problem: Context Window Overload

### Challenge 1: Tool Definition Bloat

When an AI agent connects to multiple tools via MCP, **all tool definitions must be loaded** into the model's context window.

```
Agent Context Window (128K tokens):
‚îú‚îÄ‚îÄ Tool Definitions: 25,000 tokens (50 tools √ó 500 tokens avg)
‚îú‚îÄ‚îÄ Conversation History: 15,000 tokens
‚îú‚îÄ‚îÄ System Instructions: 5,000 tokens
‚îî‚îÄ‚îÄ Available for Work: 83,000 tokens (only 65% of capacity)
```

### Challenge 2: Excessive Token Consumption

Every intermediate result from tool calls must pass through the model's context window.

**Example: Processing sales data**

| Approach | Steps | Tokens |
|----------|-------|--------|
| Traditional | Download spreadsheet ‚Üí 50K, Process ‚Üí 30K, Update ‚Üí 5K | **85,000 tokens** |
| Code Execution | Execute filtered query ‚Üí 5K (only filtered results returned) | **5,000 tokens** |

**Result: 94% token reduction with code execution approach**

### The Solution: Code Execution with MCP

Code execution allows agents to:
1. **Execute code** to process data locally (in sandbox)
2. **Filter and transform** results before returning
3. **Maintain state** across operations
4. **Build reusable functions** and workflows

## Setup

This notebook reuses the configuration file (`.foundry_config.json`) created by `0_setup/1_setup.ipynb`.

- If the file is missing, run the setup notebook first.
- Make sure you can authenticate (e.g., `az login`), so `DefaultAzureCredential` can work.

In [None]:
# Environment setup and PATH configuration
import json
import os
import subprocess
import asyncio
from datetime import datetime
from dotenv import load_dotenv

load_dotenv(override=True)

# Ensure the notebook kernel can find Azure CLI (`az`) on PATH
possible_paths = [
    '/opt/homebrew/bin',   # macOS (Apple Silicon)
    '/usr/local/bin',      # macOS (Intel) / Linux
    '/usr/bin',            # Linux / Codespaces
    '/home/linuxbrew/.linuxbrew/bin',  # Linux Homebrew
]

az_path = None
try:
    result = subprocess.run(['which', 'az'], capture_output=True, text=True)
    if result.returncode == 0:
        az_path = os.path.dirname(result.stdout.strip())
        print(f'üîç Azure CLI found: {result.stdout.strip()}')
except Exception:
    pass

paths_to_add: list[str] = []
if az_path and az_path not in os.environ.get('PATH', ''):
    paths_to_add.append(az_path)
else:
    for path in possible_paths:
        if os.path.exists(path) and path not in os.environ.get('PATH', ''):
            paths_to_add.append(path)

if paths_to_add:
    os.environ['PATH'] = ':'.join(paths_to_add) + ':' + os.environ.get('PATH', '')
    print(f"‚úÖ Added to PATH: {', '.join(paths_to_add)}")
else:
    print('‚úÖ PATH looks good already')

print(f"\nPATH (first 150 chars): {os.environ['PATH'][:150]}...")

In [None]:
# Load Foundry project settings from .foundry_config.json
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

config_file = '../0_setup/.foundry_config.json'
try:
    with open(config_file, 'r', encoding='utf-8') as f:
        config = json.load(f)
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Could not find '{config_file}'.")
    print('üí° Run 0_setup/1_setup.ipynb first to create it.')
    raise e

# project variables from config
FOUNDRY_NAME = config.get('FOUNDRY_NAME')
RESOURCE_GROUP = config.get('RESOURCE_GROUP')
LOCATION = config.get('LOCATION')
PROJECT_NAME = config.get('PROJECT_NAME', 'proj-default')
AZURE_AI_PROJECT_ENDPOINT = config.get('AZURE_AI_PROJECT_ENDPOINT')
AZURE_AI_MODEL_DEPLOYMENT_NAME = config.get('AZURE_AI_MODEL_DEPLOYMENT_NAME')

# Azure Open AI variables from env
AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")

os.environ['FOUNDRY_NAME'] = FOUNDRY_NAME or ''
os.environ['LOCATION'] = LOCATION or ''
os.environ['RESOURCE_GROUP'] = RESOURCE_GROUP or ''
os.environ['AZURE_SUBSCRIPTION_ID'] = config.get('AZURE_SUBSCRIPTION_ID', '')

print(f"‚úÖ Loaded settings from '{config_file}'.")
print(f"\nüìå Foundry name: {FOUNDRY_NAME}")
print(f"üìå Resource group: {RESOURCE_GROUP}")
print(f"üìå Location: {LOCATION}")
print(f"üìå Azure AI project endpoint: {AZURE_AI_PROJECT_ENDPOINT}")
print(f"üìå Azure Open AI endpoint: {AZURE_OPENAI_ENDPOINT}")
print(f"üìå Model deployment: {AZURE_OPENAI_CHAT_DEPLOYMENT_NAME}")
print(f"üìå Azure Open AI API version: {AZURE_OPENAI_API_VERSION}")
# Initialize credential for Azure services
credential = DefaultAzureCredential()

## Part 1: Warming Up

Before diving into context optimization, let's get familiar with **MCP (Model Context Protocol)** basics and understand how to connect to MCP servers using the Microsoft Agent Framework.

### MCP Tool Types in Microsoft Agent Framework

The Microsoft Agent Framework supports three connection types for MCP:

| Tool Class | Transport | Best For |
|------------|-----------|----------|
| `MCPStdioTool` | Standard I/O | Local process-based servers |
| `MCPStreamableHTTPTool` | HTTP + SSE | Remote HTTP endpoints |
| `MCPWebsocketTool` | WebSocket | Bidirectional real-time comm |

### 1.1 MCPStdioTool - Local MCP Server (Calculator)

`MCPStdioTool` connects to MCP servers that run as **local processes** using standard input/output. This is the simplest way to use MCP tools.

**Popular Local MCP Servers:**
- `uvx mcp-server-calculator` - Mathematical computations
- `uvx mcp-server-filesystem` - File system operations
- `uvx mcp-server-sqlite` - Database operations

In [3]:
# Example 1.1: Basic MCP Tool Call with MCPStdioTool
# This demonstrates the traditional approach where tool definitions are loaded and called

import sys
import os
import shutil

from agent_framework import MCPStdioTool
from agent_framework.azure import AzureAIClient

# Portable: Auto-detect MCP server path from current venv (works on any machine after `uv sync`)
def get_mcp_server_path(server_name: str) -> str:
    """
    Get the path to an MCP server executable in the current virtual environment.
    
    This is the recommended portable approach:
    1. First, try to find it in the same directory as the Python interpreter (venv/bin)
    2. Fallback to shutil.which() to search PATH
    
    Args:
        server_name: Name of the MCP server executable (e.g., 'mcp-server-calculator')
    
    Returns:
        Absolute path to the MCP server executable
    
    Raises:
        FileNotFoundError: If the server is not found
    """
    # Method 1: Look in the venv bin directory (most reliable)
    venv_bin_dir = os.path.dirname(sys.executable)
    venv_path = os.path.join(venv_bin_dir, server_name)
    
    if os.path.isfile(venv_path):
        return venv_path
    
    # Method 2: Fallback to PATH search
    path_result = shutil.which(server_name)
    if path_result:
        return path_result
    
    raise FileNotFoundError(
        f"‚ùå '{server_name}' not found.\n"
        f"   Checked: {venv_path}\n"
        f"   Run: uv sync  (to install dependencies from pyproject.toml)"
    )

# Get the calculator MCP server path
MCP_CALCULATOR_PATH = get_mcp_server_path("mcp-server-calculator")
print(f"üìç MCP Calculator found at: {MCP_CALCULATOR_PATH}")

async def basic_mcp_calculator_example():
    """
    Traditional MCP approach: Load calculator tool and perform computation.
    
    In this approach:
    - All tool definitions are loaded into context
    - Each tool call returns full results to the model
    - Token usage scales with data size
    """
    print("=" * 60)
    print("üßÆ Traditional MCP Tool Call: Calculator Example")
    print("=" * 60)
    
    async with (
        MCPStdioTool(
            name="calculator",
            command=MCP_CALCULATOR_PATH,
            args=[]
        ) as mcp_calculator,
        AzureAIClient(credential=credential, project_endpoint=AZURE_AI_PROJECT_ENDPOINT).create_agent(
            name="MathAgent",
            instructions="You are a helpful math assistant that can solve calculations.",
            tools=mcp_calculator,
        ) as agent,
    ):
        # Simple calculation using MCP tool
        user_query = "What is 15 * 23 + 45?"
        print(f"\nüìù User Query: {user_query}")
        
        result = await agent.run(user_query)
        
        print(f"\n‚úÖ Agent Response: {result}")
        return result

# Run the example
await basic_mcp_calculator_example()

üìç MCP Calculator found at: /afh/code/agent-operator-lab/.venv/bin/mcp-server-calculator
üßÆ Traditional MCP Tool Call: Calculator Example

üìù User Query: What is 15 * 23 + 45?

‚úÖ Agent Response: 390


<agent_framework._types.AgentRunResponse at 0x78391ffafe00>

### 1.2 MCPStreamableHTTPTool - Remote HTTP MCP Server

`MCPStreamableHTTPTool` connects to MCP servers over **HTTP with Server-Sent Events (SSE)**. This is useful for connecting to remote services like Microsoft Learn documentation.

In [4]:
# Example 1.2: HTTP-based MCP Server with MCPStreamableHTTPTool
from agent_framework import MCPStreamableHTTPTool
from agent_framework.azure import AzureAIClient

async def http_mcp_docs_example():
    """
    Connect to an HTTP-based MCP server (e.g., Microsoft Learn API).
    
    This demonstrates:
    - Remote MCP server connection via HTTP
    - SSE (Server-Sent Events) for streaming responses
    - Authentication header configuration
    """
    print("=" * 60)
    print("üìö HTTP MCP Tool: Microsoft Learn Documentation")
    print("=" * 60)
    
    async with (
        MCPStreamableHTTPTool(
            name="Microsoft Learn MCP",
            url="https://learn.microsoft.com/api/mcp",
            # headers={"Authorization": "Bearer <your-token>"},  # Uncomment if auth required
        ) as mcp_docs,
        # Use AzureAIClient pattern (credential from setup cell)
        AzureAIClient(credential=credential, project_endpoint=AZURE_AI_PROJECT_ENDPOINT).create_agent(
            name="DocsAgent",
            instructions="You help with Microsoft documentation questions.",
            tools=mcp_docs,
        ) as agent,
    ):
        user_query = "How to create an Azure storage account using az cli?"
        print(f"\nüìù User Query: {user_query}")
        
        result = await agent.run(user_query)
        
        print(f"\n‚úÖ Agent Response: {result}")
        return result

# Run the example
await http_mcp_docs_example()

üìö HTTP MCP Tool: Microsoft Learn Documentation



üìù User Query: How to create an Azure storage account using az cli?

‚úÖ Agent Response: Use these Azure CLI commands:

1) Sign in (if needed):
```bash
az login
```

2) Create a resource group:
```bash
az group create \
  --name storage-resource-group \
  --location eastus
```

3) Create the storage account (name must be globally unique, 3‚Äì24 lowercase letters/numbers):
```bash
az storage account create \
  --name <account-name> \
  --resource-group storage-resource-group \
  --location eastus \
  --sku Standard_RAGRS \
  --kind StorageV2 \
  --min-tls-version TLS1_2 \
  --allow-blob-public-access false
```

Reference: https://learn.microsoft.com/en-us/azure/storage/common/storage-account-create#create-a-storage-account


<agent_framework._types.AgentRunResponse at 0x78391f05f390>

### 1.3 Understanding Context Window Limits

Before optimizing context usage, it's essential to understand the **context window limits** of your model. Different models have different capacities, and knowing your limits helps you plan accordingly.

| Model | Context Window | Approximate Characters |
|-------|---------------|------------------------|
| GPT-4o | 128K tokens | ~512,000 chars |
| GPT-4o-mini | 128K tokens | ~512,000 chars |
| GPT-4.1 | 1M tokens | ~4,000,000 chars |
| Claude 3.5 | 200K tokens | ~800,000 chars |

**Why Context Window Matters:**
- Tool schemas consume tokens on **every API call**
- Conversation history grows with each turn
- Large data payloads can quickly exhaust the window
- Leaving room for model output is essential

In [None]:
# 1.3: Check your model's context window and current usage
# This helps you understand how much capacity you have for optimization

import tiktoken
from openai import AzureOpenAI

def estimate_tokens(text: str, model: str = "gpt-4.1") -> int:
    """Estimate token count for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Known context window sizes for common models (as of 2025)
MODEL_CONTEXT_WINDOWS = {
    # GPT-4 series
    "gpt-4": 8192,
    "gpt-4-32k": 32768,
    "gpt-4-turbo": 128000,
    "gpt-4-turbo-preview": 128000,
    "gpt-4o": 128000,
    "gpt-4o-mini": 128000,
    "gpt-4.1": 1000000,  # 1M tokens (STU)
    "gpt-4.1-mini": 1000000,
    "gpt-4.1-nano": 1000000,
    # GPT-5 series
    "gpt-5": 1000000,
    "gpt-5.1": 1000000,
    # GPT-3.5 series
    "gpt-35-turbo": 4096,
    "gpt-35-turbo-16k": 16384,
    "gpt-3.5-turbo": 4096,
    "gpt-3.5-turbo-16k": 16384,
    # o1 series
    "o1": 200000,
    "o1-mini": 128000,
    "o1-preview": 128000,
    "o3": 200000,
    "o3-mini": 200000,
}

def get_model_context_window(deployment_name: str, client: AzureOpenAI = None) -> tuple[int, str]:
    """
    Get the context window size for a model deployment.
    
    Tries multiple methods:
    1. Query Azure OpenAI API for model info
    2. Fall back to known model context windows lookup
    
    Returns:
        tuple: (context_window_size, source_method)
    """
    # Method 1: Try to get model info from Azure OpenAI API
    if client:
        try:
            # List models and find the one matching our deployment
            models = client.models.list()
            for model in models.data:
                # Check if this model matches our deployment
                if deployment_name.lower() in model.id.lower() or model.id.lower() in deployment_name.lower():
                    # Some APIs return context_length or max_tokens
                    if hasattr(model, 'context_length') and model.context_length:
                        return model.context_length, "API (context_length)"
                    if hasattr(model, 'max_tokens') and model.max_tokens:
                        return model.max_tokens, "API (max_tokens)"
                    # Try to extract from model ID - sort by length to match specific names first
                    model_id = model.id.lower()
                    sorted_models = sorted(MODEL_CONTEXT_WINDOWS.items(), key=lambda x: len(x[0]), reverse=True)
                    for known_model, ctx_size in sorted_models:
                        if known_model in model_id:
                            return ctx_size, f"API model match ({model.id})"
        except Exception as e:
            print(f"   ‚ö†Ô∏è API query failed: {e}")
    
    # Method 2: Look up from known models table
    deployment_lower = deployment_name.lower()
    
    # Try exact match first
    if deployment_lower in MODEL_CONTEXT_WINDOWS:
        return MODEL_CONTEXT_WINDOWS[deployment_lower], "Known models table (exact)"
    
    # Try partial match - sort by model name length descending to match more specific names first
    # e.g., "gpt-4.1" should match before "gpt-4"
    sorted_models = sorted(MODEL_CONTEXT_WINDOWS.items(), key=lambda x: len(x[0]), reverse=True)
    for model_name, ctx_size in sorted_models:
        if model_name in deployment_lower or deployment_lower in model_name:
            return ctx_size, f"Known models table (partial: {model_name})"
    
    # Default fallback
    return 128000, "Default fallback (128K)"

def check_context_window():
    """
    Check the model's context window and demonstrate token counting.
    """
    print("=" * 70)
    print("üìè 1.3: Understanding Context Window Limits")
    print("=" * 70)
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Get model info
    print(f"\nüìå Deployment Name: {AZURE_OPENAI_CHAT_DEPLOYMENT_NAME}")
    
    # Query context window size
    context_window, source = get_model_context_window(AZURE_OPENAI_CHAT_DEPLOYMENT_NAME, client)
    print(f"üìê Context Window: {context_window:,} tokens")
    print(f"üì° Source: {source}")
    
    # Sample content to demonstrate token counting
    sample_texts = {
        "Simple prompt": "What is the capital of France?",
        "System instruction": "You are a helpful AI assistant that provides accurate and concise answers.",
        "1KB of text": "Lorem ipsum " * 100,
        "Sample JSON (100 records)": json.dumps([{"id": i, "name": f"item_{i}", "value": i * 10} for i in range(100)]),
    }
    
    print(f"\nüìä Token Estimates for Common Content Types:")
    print("-" * 50)
    
    for name, text in sample_texts.items():
        tokens = estimate_tokens(text)
        chars = len(text)
        ratio = chars / tokens if tokens > 0 else 0
        print(f"   {name}:")
        print(f"      Characters: {chars:,} | Tokens: {tokens:,} | Ratio: {ratio:.1f} chars/token")
    
    # Context window budget example (using queried value)
    print(f"\nüìã Context Window Budget ({context_window:,} tokens):")
    print("-" * 50)
    
    budget = {
        "Total capacity": context_window,
        "System prompt": 500,
        "Tool schemas (10 tools)": 3000,
        "Conversation history (5 turns)": 2500,
        "User query": 200,
        "Reserved for output": 4000,
    }
    
    used = sum(v for k, v in budget.items() if k not in ["Total capacity", "Reserved for output"])
    available = budget["Total capacity"] - used - budget["Reserved for output"]
    
    for name, tokens in budget.items():
        if name == "Total capacity":
            print(f"   {name}: {tokens:,} tokens")
        else:
            pct = (tokens / budget["Total capacity"]) * 100
            print(f"   - {name}: {tokens:,} tokens ({pct:.1f}%)")
    
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   Available for data/work: {available:,} tokens ({(available/budget['Total capacity'])*100:.1f}%)")
    
    print(f"\nüí° Tip: The examples in Part 2 will show you how to maximize this available space!")
    
    return budget, context_window

# Run the context window check
context_budget, model_context_window = check_context_window()

üìè 1.3: Understanding Context Window Limits

üìå Deployment Name: gpt-4.1
üìê Context Window: 1,000,000 tokens
üì° Source: API model match (gpt-4.1-2025-04-14)

üìä Token Estimates for Common Content Types:
--------------------------------------------------
   Simple prompt:
      Characters: 30 | Tokens: 7 | Ratio: 4.3 chars/token
   System instruction:
      Characters: 74 | Tokens: 13 | Ratio: 5.7 chars/token
   1KB of text:
      Characters: 1,200 | Tokens: 201 | Ratio: 6.0 chars/token
   Sample JSON (100 records):
      Characters: 4,469 | Tokens: 2,001 | Ratio: 2.2 chars/token

üìã Context Window Budget (1,000,000 tokens):
--------------------------------------------------
   Total capacity: 1,000,000 tokens
   - System prompt: 500 tokens (0.1%)
   - Tool schemas (10 tools): 3,000 tokens (0.3%)
   - Conversation history (5 turns): 2,500 tokens (0.2%)
   - User query: 200 tokens (0.0%)
   - Reserved for output: 4,000 tokens (0.4%)
   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

## Part 2: Context Optimization

Now that you understand MCP basics and context window limits, let's explore **optimization strategies** through **Anti-Pattern vs Best Practice** comparisons.

**Problem:** Every byte of data consumes tokens, even if we only need a summary!

---

In this example, we send the **full dataset** to Azure OpenAI for summarization. This demonstrates the token consumption problem when all data must pass through the model's context window.

### Section 1: Single Code Execution

This section compares sending **full data through the context window** (anti-pattern) vs **processing data in a sandbox** (best practice).

### 1.1 [Anti-Pattern] Full Data in Context

In [9]:
# Example 1.1: [Anti-Pattern] Full Data Summarization with Azure OpenAI
# This demonstrates token consumption when full data is sent to the model

import tiktoken
from openai import AzureOpenAI

def estimate_tokens(text: str, model: str = "gpt-4.1") -> int:
    """Estimate token count for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Generate sample sales data
def generate_sample_sales_data(num_rows: int = 1000) -> list[dict]:
    """Generate sample sales data for demonstration."""
    import random
    
    products = ["Widget A", "Widget B", "Gadget X", "Gadget Y", "Tool Z"]
    quarters = ["Q1", "Q2", "Q3", "Q4"]
    regions = ["North", "South", "East", "West"]
    
    data = []
    for i in range(num_rows):
        data.append({
            "id": i + 1,
            "product": random.choice(products),
            "amount": round(random.uniform(100, 50000), 2),
            "quarter": random.choice(quarters),
            "region": random.choice(regions),
            "date": f"2025-{random.randint(1,12):02d}-{random.randint(1,28):02d}",
        })
    return data

# Generate sample data
sample_data = generate_sample_sales_data(1000)
full_data_json = json.dumps(sample_data, indent=2)

print("="  * 70)
print("üìä 1.1 [Anti-Pattern] Full Data in Context")
print("=" * 70)
print(f"\nüìà Generated {len(sample_data)} sales records")
print(f"üìù Full data JSON size: {len(full_data_json):,} characters")

# Create Azure OpenAI client using Foundry endpoint
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
)

# Traditional approach: Send FULL data to the model for summarization
summarization_prompt = f"""Analyze the following sales data and provide a brief summary including:
1. Total number of records
2. Top 3 products by sales amount
3. Best performing quarter
4. Regional distribution

Sales Data:
{full_data_json}
"""

print(f"\nüîÑ Sending full data to Azure OpenAI for summarization...")
print(f"   Prompt tokens (estimated): {estimate_tokens(summarization_prompt):,}")

response = client.chat.completions.create(
    model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
    messages=[
        {"role": "system", "content": "You are a data analyst. Provide concise summaries."},
        {"role": "user", "content": summarization_prompt}
    ],
    max_tokens=500,
)

# Extract token usage from response
usage = response.usage
total_tokens_full_context = usage.total_tokens

print(f"\n‚úÖ Response received!")
print(f"\nüìã Summary:")
print("-" * 50)
print(response.choices[0].message.content)
print("-" * 50)

print(f"\nüéüÔ∏è  Token Usage (Full Context Approach):")
print(f"   Prompt tokens:     {usage.prompt_tokens:,}")
print(f"   Completion tokens: {usage.completion_tokens:,}")
print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
print(f"   TOTAL tokens:      {usage.total_tokens:,}")

üìä 1.1 [Anti-Pattern] Full Data in Context

üìà Generated 1000 sales records
üìù Full data JSON size: 142,651 characters

üîÑ Sending full data to Azure OpenAI for summarization...
   Prompt tokens (estimated): 56,427

‚úÖ Response received!

üìã Summary:
--------------------------------------------------
**Sales Data Summary:**

**1. Total Number of Records**
- **1,000 records**

---

**2. Top 3 Products by Sales Amount**

Aggregate sales amounts by product:
- Widget A: **$13,169,556.14**
- Gadget Y: **$12,474,453.15**
- Widget B: **$12,308,032.79**

**Top 3:**
1. **Widget A**
2. **Gadget Y**
3. **Widget B**

---

**3. Best Performing Quarter**

Total sales by quarter:
- Q1: **$13,253,704.27**
- Q2: **$12,183,256.89**
- Q3: **$13,951,029.40**
- Q4: **$12,083,484.69**

**Highest:**  
**Q3 ($13,951,029.40)**

---

**4. Regional Distribution**

Total sales by region:
- **North:** $8,091,539.10
- **South:** $9,718,494.44
- **East:** $8,572,454.34
- **West:** $8,089,987.38

**Largest

### 1.2 [Best Practice] Data Generation in Sandbox

Instead of sending full data through the context, we use **Code Execution** to process data in a sandboxed environment. Only the **summary** is returned to the model.

**Key Benefits:**
| Benefit | Description |
|---------|-----------|
| **Token Efficiency** | Raw data never enters context |
| **Scalability** | Works with any data size |
| **Privacy** | Sensitive data stays in sandbox |

In [10]:
# Example 1.2: Code Execution Approach - Data Generation with Remote Code Execution
# This demonstrates token efficiency when data is generated and processed in a sandbox

from agent_framework.azure import AzureAIClient
from azure.ai.projects.models import CodeInterpreterTool

async def code_execution_example():
    """
    Code Execution approach: Generate and analyze data in a sandbox.
    
    This demonstrates:
    - Remote code execution via CodeInterpreterTool
    - Data generation inside sandbox (never enters context)
    - Only summary returned to model
    """
    print("=" * 70)
    print("üöÄ 1.2 [Best Practice] Data Generation in Sandbox")
    print("=" * 70)
    
    # User query that triggers code execution
    code_execution_query = """
    Generate 1000 sales records with the following structure:
    - id, product (Widget A/B, Gadget X/Y, Tool Z), amount (100-50000), quarter (Q1-Q4), region (North/South/East/West), date

    Then analyze the data and return ONLY a summary with:
    1. Total records and total sales amount
    2. Top 3 products by sales
    3. Best quarter by revenue
    4. Regional breakdown

    Use Python with pandas. Return the summary as a formatted report.
    """
    
    print(f"\nüìù Sending code execution request...")
    print(f"   Prompt tokens (estimated): {estimate_tokens(code_execution_query):,}")
    print(f"   Note: Raw data stays in sandbox - never enters model context!")
    
    async with (
        # Use AzureAIClient pattern with CodeInterpreterTool
        AzureAIClient(credential=credential, project_endpoint=AZURE_AI_PROJECT_ENDPOINT).create_agent(
            name="DataAnalystAgent",
            instructions="You are a data analyst. Generate data using Python code and return only summarized results.",
            tools=CodeInterpreterTool(),
        ) as agent,
    ):
        result = await agent.run(code_execution_query)
        
        print(f"\n‚úÖ Code executed in sandbox!")
        print(f"\nüìã Summary (from sandbox execution):")
        print("-" * 50)
        # Get the response text
        response_text = str(result)
        print(response_text[:1500] + "..." if len(response_text) > 1500 else response_text)
        print("-" * 50)
        
        # Calculate token usage for code execution approach
        # Only the prompt and summary pass through - not the 1000 records!
        summary_tokens = estimate_tokens(response_text)
        prompt_tokens_code_exec = estimate_tokens(code_execution_query)
        total_tokens_code_exec = prompt_tokens_code_exec + summary_tokens
        
        print(f"\nüéüÔ∏è  Token Usage (Code Execution Approach):")
        print(f"   Prompt tokens:     {prompt_tokens_code_exec:,}")
        print(f"   Response tokens:   {summary_tokens:,}")
        print(f"   Data in sandbox:   ~{estimate_tokens(full_data_json):,} (NOT counted!)")
        print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
        print(f"   TOTAL tokens:      {total_tokens_code_exec:,}")
        
        # Compare with traditional approach
        print(f"\n" + "=" * 70)
        print("üìä Comparison: Full Context vs Code Execution")
        print("=" * 70)
        print(f"   Full Context (1.1):    {total_tokens_full_context:,} tokens")
        print(f"   Code Execution (1.2): {total_tokens_code_exec:,} tokens")
        print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
        savings = ((total_tokens_full_context - total_tokens_code_exec) / total_tokens_full_context) * 100
        print(f"   üí∞ Token Savings:     {savings:.1f}%")
        
        return result

# Run the example
await code_execution_example()

üöÄ 1.2 [Best Practice] Data Generation in Sandbox

üìù Sending code execution request...
   Prompt tokens (estimated): 122
   Note: Raw data stays in sandbox - never enters model context!

‚úÖ Code executed in sandbox!

üìã Summary (from sandbox execution):
--------------------------------------------------
import pandas as pd, numpy as np
np.random.seed(42)
n=1000
products=["Widget A","Widget B","Gadget X","Gadget Y","Tool Z"]
quarters=["Q1","Q2","Q3","Q4"]
regions=["North","South","East","West"]
ids=np.arange(1,n+1)
prod=np.random.choice(products,n)
amount=np.random.randint(100,50001,n)
q=np.random.choice(quarters,n)
reg=np.random.choice(regions,n)
# generate dates within 2025, aligned with quarter
# map quarter to month ranges
q_months={"Q1":(1,3),"Q2":(4,6),"Q3":(7,9),"Q4":(10,12)}
dates=[]
for qq in q:
    m1,m2=q_months[qq]
    month=np.random.randint(m1,m2+1)
    day=np.random.randint(1,29)
    dates.append(pd.Timestamp(year=2025,month=month,day=day))
df=pd.DataFrame({"id":i

<agent_framework._types.AgentRunResponse at 0x78391f05fc50>

### Section 2: Multi-Step Code Execution

This section compares sending **full data through complex workflows** (anti-pattern) vs **multi-step pipeline in single execution** (best practice).

### Example 2.1: [Anti-Pattern] Complex Workflows with Full Data

In this example, we execute a **4-step data analysis workflow** using the traditional approach. Each step requires sending the **full dataset** to Azure OpenAI, demonstrating the token cost of multi-step workflows.

| Step | Description | Data Sent |
|------|-------------|-----------|
| 1 | Analyze data structure | Full dataset |
| 2 | Categorize amounts | Full dataset |
| 3 | Aggregate by dimensions | Full dataset |
| 4 | Generate final report | Full dataset |

**Problem**: The same dataset is sent 4 times, consuming tokens each time!

In [11]:
# Example 2.1: Complex Workflows with Full Data (4 Round Trips)
# This demonstrates the token cost when executing multi-step workflows traditionally

from openai import AzureOpenAI

def traditional_multi_step_workflow():
    """
    Traditional approach: Execute 4 separate API calls, each with full data.
    
    This simulates a typical multi-step workflow:
    1. Generate/Load data
    2. Transform data
    3. Aggregate data
    4. Generate report
    
    Each step requires sending full data to the model.
    """
    print("=" * 70)
    print("üìä 2.1 [Anti-Pattern] Complex Workflows (4 Round Trips)")
    print("=" * 70)
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    total_tokens_all_steps = 0
    
    # ========================================
    # Step 1: Analyze raw data structure
    # ========================================
    print("\nüîÑ Step 1: Analyzing data structure...")
    step1_prompt = f"""Analyze the following sales data and describe its structure:
- Number of records
- Fields available
- Data types

Sales Data:
{full_data_json}
"""
    
    response1 = client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        messages=[
            {"role": "system", "content": "You are a data analyst."},
            {"role": "user", "content": step1_prompt}
        ],
        max_tokens=300,
    )
    step1_tokens = response1.usage.total_tokens
    total_tokens_all_steps += step1_tokens
    print(f"   ‚úÖ Step 1 completed: {step1_tokens:,} tokens")
    
    # ========================================
    # Step 2: Transform and categorize data
    # ========================================
    print("\nüîÑ Step 2: Categorizing amounts...")
    step2_prompt = f"""Analyze the sales data and categorize the amounts:
- Low: < 1,000
- Medium: 1,000 - 5,000
- High: 5,000 - 20,000
- Premium: > 20,000

Count how many records fall into each category.

Sales Data:
{full_data_json}
"""
    
    response2 = client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        messages=[
            {"role": "system", "content": "You are a data analyst."},
            {"role": "user", "content": step2_prompt}
        ],
        max_tokens=300,
    )
    step2_tokens = response2.usage.total_tokens
    total_tokens_all_steps += step2_tokens
    print(f"   ‚úÖ Step 2 completed: {step2_tokens:,} tokens")
    
    # ========================================
    # Step 3: Aggregate by dimensions
    # ========================================
    print("\nüîÑ Step 3: Aggregating by quarter and region...")
    step3_prompt = f"""Analyze the sales data and provide:
1. Total sales by quarter (Q1, Q2, Q3, Q4)
2. Total sales by region (North, South, East, West)
3. Top 3 products by total sales

Sales Data:
{full_data_json}
"""
    
    response3 = client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        messages=[
            {"role": "system", "content": "You are a data analyst."},
            {"role": "user", "content": step3_prompt}
        ],
        max_tokens=400,
    )
    step3_tokens = response3.usage.total_tokens
    total_tokens_all_steps += step3_tokens
    print(f"   ‚úÖ Step 3 completed: {step3_tokens:,} tokens")
    
    # ========================================
    # Step 4: Generate final report
    # ========================================
    print("\nüîÑ Step 4: Generating final report...")
    step4_prompt = f"""Based on the sales data, generate a comprehensive executive summary including:
1. Overall performance metrics (total revenue, average sale, max/min)
2. Best performing quarter
3. Regional insights
4. Product recommendations

Sales Data:
{full_data_json}
"""
    
    response4 = client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        messages=[
            {"role": "system", "content": "You are a business analyst."},
            {"role": "user", "content": step4_prompt}
        ],
        max_tokens=500,
    )
    step4_tokens = response4.usage.total_tokens
    total_tokens_all_steps += step4_tokens
    print(f"   ‚úÖ Step 4 completed: {step4_tokens:,} tokens")
    
    # ========================================
    # Summary
    # ========================================
    print("\n" + "=" * 70)
    print("üìã Final Report (from Step 4):")
    print("-" * 50)
    print(response4.choices[0].message.content)
    print("-" * 50)
    
    print(f"\nüéüÔ∏è  Token Usage (Traditional 4-Step Workflow):")
    print(f"   Step 1 (Structure):    {step1_tokens:,} tokens")
    print(f"   Step 2 (Categorize):   {step2_tokens:,} tokens")
    print(f"   Step 3 (Aggregate):    {step3_tokens:,} tokens")
    print(f"   Step 4 (Report):       {step4_tokens:,} tokens")
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   TOTAL tokens:          {total_tokens_all_steps:,} tokens")
    
    print(f"\n‚ö†Ô∏è  Note: Full dataset ({len(sample_data)} records) was sent 4 times!")
    print(f"   Data tokens per trip:  ~{estimate_tokens(full_data_json):,}")
    print(f"   Data tokens (4 trips): ~{estimate_tokens(full_data_json) * 4:,}")
    
    return total_tokens_all_steps

# Run the complex workflow
total_tokens_4trips = traditional_multi_step_workflow()

üìä 2.1 [Anti-Pattern] Complex Workflows (4 Round Trips)

üîÑ Step 1: Analyzing data structure...
   ‚úÖ Step 1 completed: 56,726 tokens

üîÑ Step 2: Categorizing amounts...
   ‚úÖ Step 2 completed: 56,765 tokens

üîÑ Step 3: Aggregating by quarter and region...
   ‚úÖ Step 3 completed: 56,854 tokens

üîÑ Step 4: Generating final report...
   ‚úÖ Step 4 completed: 56,947 tokens

üìã Final Report (from Step 4):
--------------------------------------------------
Here is an executive summary based on the analysis of the provided sales data (1,000 transactions across products, regions, and quarters for 2025):

---

### Executive Summary

#### 1. Overall Performance Metrics

- **Total Revenue:**  
  $31,513,603.08

- **Average Sale (per transaction):**  
  $31,513.60

- **Maximum Single Sale (Transaction):**  
  $49,996.19 (Gadget X, North, Q1)

- **Minimum Single Sale (Transaction):**  
  $102.41 (Tool Z, West, Q2)

---

#### 2. Best Performing Quarter

| Quarter | Total Revenue    |

### 2.2 [Best Practice] Multi-Step Pipeline in Single Execution

With code execution, agents can use familiar programming constructs (loops, conditionals, error handling) to execute **complex workflows in a single step** rather than multiple round trips.

In [12]:
# Example 2.2: Multi-Step Pipeline in Single Code Execution
# This demonstrates executing complex multi-step workflows in a single agent call

from agent_framework.azure import AzureAIClient
from azure.ai.projects.models import CodeInterpreterTool

async def multi_step_pipeline_example():
    """
    Multi-step data pipeline executed in a single agent call.
    
    Traditional approach would require 4+ round trips:
    1. Get data -> tokens consumed
    2. Transform data -> tokens consumed
    3. Aggregate data -> tokens consumed
    4. Format output -> tokens consumed
    
    Code execution: All steps run in sandbox, only summary returned.
    """
    print("=" * 70)
    print("üîÑ 2.2 [Best Practice] Multi-Step Pipeline in Single Execution")
    print("=" * 70)
    
    # Complex multi-step pipeline request
    pipeline_query = """
    Execute a complete data analysis pipeline with these steps:
    
    **Step 1: Generate Data**
    Create 1000 sales records with: id, product (Widget A/B, Gadget X/Y, Tool Z), 
    amount (100-50000), quarter (Q1-Q4), region (North/South/East/West), date
    
    **Step 2: Transform Data**
    - Categorize amounts: Low (<1000), Medium (1000-5000), High (5000-20000), Premium (>20000)
    - Parse dates and extract month
    
    **Step 3: Aggregate by Multiple Dimensions**
    - Quarterly totals and averages
    - Regional breakdown
    - Category distribution
    - Top products
    
    **Step 4: Generate Report**
    Return a formatted summary report with:
    - Total records processed
    - Quarterly performance (Q1-Q4 totals)
    - Regional performance
    - Overall metrics (total revenue, avg sale, max/min)
    - Category distribution
    
    Use Python with pandas. Execute all steps and return ONLY the final summary.
    """
    
    print(f"\nüìù Sending multi-step pipeline request...")
    print(f"   Prompt tokens (estimated): {estimate_tokens(pipeline_query):,}")
    print(f"   Note: All 4 steps execute in sandbox - only final report returned!")
    
    async with (
        AzureAIClient(credential=credential, project_endpoint=AZURE_AI_PROJECT_ENDPOINT).create_agent(
            name="PipelineAgent",
            instructions="You are a data engineer. Execute multi-step data pipelines and return only summarized results.",
            tools=CodeInterpreterTool(),
        ) as agent,
    ):
        result = await agent.run(pipeline_query)
        
        print(f"\n‚úÖ Pipeline executed in sandbox!")
        print(f"\nüìã Pipeline Result:")
        print("-" * 50)
        response_text = str(result)
        print(response_text[:2000] + "..." if len(response_text) > 2000 else response_text)
        print("-" * 50)
        
        # Token analysis
        response_tokens = estimate_tokens(response_text)
        prompt_tokens = estimate_tokens(pipeline_query)
        total_tokens_pipeline = prompt_tokens + response_tokens
        
        print(f"\nüéüÔ∏è  Token Usage (Multi-Step Pipeline):")
        print(f"   Prompt tokens:          {prompt_tokens:,}")
        print(f"   Response tokens:        {response_tokens:,}")
        print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
        print(f"   TOTAL tokens:           {total_tokens_pipeline:,}")
        
        # Compare with actual traditional approach from Example 2.1
        print(f"\nüìä Comparison: Traditional 4 Round Trips (2.1) vs Single Pipeline (2.2)")
        print(f"   Traditional (1.4):       {total_tokens_4trips:,} tokens")
        print(f"   Code Execution (2.2):    {total_tokens_pipeline:,} tokens")
        print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
        savings = ((total_tokens_4trips - total_tokens_pipeline) / total_tokens_4trips) * 100
        print(f"   üí∞ Token Savings:        {savings:.1f}%")
        
        return result

# Run the example
await multi_step_pipeline_example()

üîÑ 2.2 [Best Practice] Multi-Step Pipeline in Single Execution

üìù Sending multi-step pipeline request...
   Prompt tokens (estimated): 232
   Note: All 4 steps execute in sandbox - only final report returned!

‚úÖ Pipeline executed in sandbox!

üìã Pipeline Result:
--------------------------------------------------
import pandas as pd, numpy as np
np.random.seed(42)
n=1000
products=["Widget A","Widget B","Gadget X","Gadget Y","Tool Z"]
quarters=["Q1","Q2","Q3","Q4"]
regions=["North","South","East","West"]
# generate dates within 2025 maybe
start=pd.Timestamp("2025-01-01")
end=pd.Timestamp("2025-12-31")
dates=pd.to_datetime(np.random.randint(start.value//10**9, end.value//10**9, n), unit='s')
df=pd.DataFrame({
    "id": np.arange(1,n+1),
    "product": np.random.choice(products,n, p=[0.22,0.18,0.2,0.2,0.2]),
    "amount": np.random.randint(100,50001,n),
    "quarter": np.random.choice(quarters,n),
    "region": np.random.choice(regions,n),
    "date": dates
})
# transform
bins=[-n

<agent_framework._types.AgentRunResponse at 0x78391f0aa780>

---

### Section 3: MCP Optimization

This section compares **loading all MCP tools** (anti-pattern) vs **filtering tools by intent** (best practice).

### 3.1 [Anti-Pattern] Remote MCP Schema Bloat

When connecting to **large MCP servers** (like Azure MCP with 50+ tools), the tool schema definitions alone consume significant context tokens. Every conversation turn includes ALL tool definitions, even if only one tool is used.

| MCP Server | Estimated Tools | Schema Tokens |
|------------|-----------------|---------------|
| Calculator | 4-5 tools | ~500 tokens |
| Filesystem | 10-15 tools | ~2,000 tokens |
| GitHub | 20-30 tools | ~5,000 tokens |
| **Azure MCP** | **50+ tools** | **~15,000+ tokens** |

**Problem**: Tool schemas are included in EVERY API call, multiplying token cost with conversation length!

In [13]:
# Example 3.1 [Anti-Pattern] Tool Schema Bloat with Large MCP Servers
# This demonstrates how large tool schemas consume context even before any work is done

from openai import AzureOpenAI

def simulate_large_mcp_schema():
    """
    Simulates the tool schema from a large MCP server like Azure MCP.
    In reality, Azure MCP has 50+ tools for various Azure services.
    """
    # Simulated Azure MCP tool definitions (simplified)
    azure_mcp_tools = [
        {
            "type": "function",
            "function": {
                "name": "azure_storage_list_containers",
                "description": "List all containers in an Azure Storage account. Returns container names, metadata, and properties.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "storage_account": {"type": "string", "description": "Name of the Azure Storage account"},
                        "resource_group": {"type": "string", "description": "Resource group containing the storage account"},
                        "subscription_id": {"type": "string", "description": "Azure subscription ID"},
                        "include_metadata": {"type": "boolean", "description": "Whether to include container metadata"},
                        "include_deleted": {"type": "boolean", "description": "Whether to include soft-deleted containers"},
                    },
                    "required": ["storage_account", "resource_group"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_storage_upload_blob",
                "description": "Upload a blob to an Azure Storage container. Supports block blobs, append blobs, and page blobs.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "storage_account": {"type": "string", "description": "Name of the Azure Storage account"},
                        "container": {"type": "string", "description": "Name of the container"},
                        "blob_name": {"type": "string", "description": "Name of the blob to create"},
                        "content": {"type": "string", "description": "Content to upload"},
                        "blob_type": {"type": "string", "enum": ["BlockBlob", "AppendBlob", "PageBlob"]},
                        "content_type": {"type": "string", "description": "MIME type of the content"},
                        "metadata": {"type": "object", "description": "Custom metadata for the blob"},
                    },
                    "required": ["storage_account", "container", "blob_name", "content"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_vm_list",
                "description": "List all virtual machines in a subscription or resource group.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "subscription_id": {"type": "string", "description": "Azure subscription ID"},
                        "resource_group": {"type": "string", "description": "Optional resource group filter"},
                        "status_filter": {"type": "string", "enum": ["running", "stopped", "deallocated", "all"]},
                    },
                    "required": ["subscription_id"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_vm_start",
                "description": "Start an Azure virtual machine.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "vm_name": {"type": "string", "description": "Name of the virtual machine"},
                        "resource_group": {"type": "string", "description": "Resource group containing the VM"},
                        "subscription_id": {"type": "string", "description": "Azure subscription ID"},
                    },
                    "required": ["vm_name", "resource_group"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_vm_stop",
                "description": "Stop an Azure virtual machine. Can optionally deallocate to save costs.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "vm_name": {"type": "string", "description": "Name of the virtual machine"},
                        "resource_group": {"type": "string", "description": "Resource group containing the VM"},
                        "subscription_id": {"type": "string", "description": "Azure subscription ID"},
                        "deallocate": {"type": "boolean", "description": "Whether to deallocate (true) or just stop (false)"},
                    },
                    "required": ["vm_name", "resource_group"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_keyvault_get_secret",
                "description": "Retrieve a secret from Azure Key Vault.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "vault_name": {"type": "string", "description": "Name of the Key Vault"},
                        "secret_name": {"type": "string", "description": "Name of the secret"},
                        "version": {"type": "string", "description": "Optional secret version"},
                    },
                    "required": ["vault_name", "secret_name"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_keyvault_set_secret",
                "description": "Create or update a secret in Azure Key Vault.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "vault_name": {"type": "string", "description": "Name of the Key Vault"},
                        "secret_name": {"type": "string", "description": "Name of the secret"},
                        "value": {"type": "string", "description": "Value of the secret"},
                        "content_type": {"type": "string", "description": "Content type of the secret"},
                        "expires_on": {"type": "string", "description": "Expiration date in ISO format"},
                    },
                    "required": ["vault_name", "secret_name", "value"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_cosmos_query",
                "description": "Execute a SQL query against an Azure Cosmos DB container.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "account": {"type": "string", "description": "Cosmos DB account name"},
                        "database": {"type": "string", "description": "Database name"},
                        "container": {"type": "string", "description": "Container name"},
                        "query": {"type": "string", "description": "SQL query to execute"},
                        "parameters": {"type": "array", "items": {"type": "object"}, "description": "Query parameters"},
                        "max_items": {"type": "integer", "description": "Maximum items to return"},
                    },
                    "required": ["account", "database", "container", "query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_sql_execute",
                "description": "Execute a SQL query against Azure SQL Database.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "server": {"type": "string", "description": "SQL Server name"},
                        "database": {"type": "string", "description": "Database name"},
                        "query": {"type": "string", "description": "SQL query to execute"},
                        "parameters": {"type": "object", "description": "Query parameters"},
                    },
                    "required": ["server", "database", "query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "azure_function_invoke",
                "description": "Invoke an Azure Function via HTTP trigger.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "function_app": {"type": "string", "description": "Function App name"},
                        "function_name": {"type": "string", "description": "Function name"},
                        "method": {"type": "string", "enum": ["GET", "POST", "PUT", "DELETE"]},
                        "body": {"type": "object", "description": "Request body"},
                        "headers": {"type": "object", "description": "Request headers"},
                    },
                    "required": ["function_app", "function_name"]
                }
            }
        },
    ]
    
    # Add more tools to simulate a realistic large MCP server (50+ tools)
    additional_services = [
        "azure_monitor_query_logs", "azure_monitor_get_metrics", "azure_monitor_create_alert",
        "azure_aks_list_clusters", "azure_aks_get_credentials", "azure_aks_scale_nodepool",
        "azure_acr_list_repos", "azure_acr_push_image", "azure_acr_delete_image",
        "azure_servicebus_send_message", "azure_servicebus_receive_messages", "azure_servicebus_create_queue",
        "azure_eventhub_send_event", "azure_eventhub_receive_events", "azure_eventhub_create_hub",
        "azure_redis_get", "azure_redis_set", "azure_redis_delete", "azure_redis_list_keys",
        "azure_apim_list_apis", "azure_apim_create_api", "azure_apim_get_subscription_key",
        "azure_logic_app_trigger", "azure_logic_app_list_runs", "azure_logic_app_cancel_run",
        "azure_cdn_purge", "azure_cdn_list_endpoints", "azure_cdn_create_profile",
        "azure_frontdoor_list_routes", "azure_frontdoor_create_origin", "azure_frontdoor_update_policy",
        "azure_dns_list_zones", "azure_dns_create_record", "azure_dns_delete_record",
        "azure_network_list_vnets", "azure_network_create_subnet", "azure_network_get_nsg_rules",
        "azure_appservice_list_apps", "azure_appservice_deploy", "azure_appservice_restart",
        "azure_batch_create_job", "azure_batch_list_tasks", "azure_batch_get_output",
    ]
    
    for service in additional_services:
        azure_mcp_tools.append({
            "type": "function",
            "function": {
                "name": service,
                "description": f"Execute {service.replace('_', ' ')} operation in Azure.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "resource_id": {"type": "string", "description": "Azure resource ID"},
                        "options": {"type": "object", "description": "Operation-specific options"},
                    },
                    "required": ["resource_id"]
                }
            }
        })
    
    return azure_mcp_tools


def traditional_tool_schema_bloat_example():
    """
    Traditional approach: All tool schemas included in every API call.
    
    This demonstrates the problem of tool schema bloat where:
    - 50+ tool definitions consume ~15,000+ tokens
    - Every conversation turn repeats all tool definitions
    - Multi-turn conversations multiply the token cost
    """
    print("=" * 70)
    print("üìä 3.1 [Anti-Pattern] Remote MCP Schema Bloat")
    print("=" * 70)
    
    # Get simulated large MCP tools
    azure_mcp_tools = simulate_large_mcp_schema()
    
    tools_json = json.dumps(azure_mcp_tools, indent=2)
    tools_token_count = estimate_tokens(tools_json)
    
    print(f"\nüîß Simulated Azure MCP Server:")
    print(f"   Total tools: {len(azure_mcp_tools)}")
    print(f"   Schema tokens: {tools_token_count:,}")
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Simulate a 3-turn conversation with tool schema bloat
    conversation_history = [
        {"role": "system", "content": "You are an Azure infrastructure assistant. Help users manage their Azure resources."}
    ]
    
    user_queries = [
        "List all my storage containers in the 'production-rg' resource group.",
        "Now show me all VMs in the same resource group.",
        "Finally, get the connection string secret from my 'prod-keyvault'.",
    ]
    
    total_tokens_all_turns = 0
    turn_details = []
    
    for turn_num, query in enumerate(user_queries, 1):
        print(f"\nüîÑ Turn {turn_num}: {query[:50]}...")
        
        # Add user message
        conversation_history.append({"role": "user", "content": query})
        
        # Make API call with ALL tools (traditional approach)
        response = client.chat.completions.create(
            model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
            messages=conversation_history,
            tools=azure_mcp_tools,
            tool_choice="auto",
            max_tokens=300,
        )
        
        usage = response.usage
        total_tokens_all_turns += usage.total_tokens
        
        # Get assistant response
        assistant_message = response.choices[0].message
        
        # Add assistant response to history (simulated tool response)
        if assistant_message.tool_calls:
            tool_name = assistant_message.tool_calls[0].function.name
            conversation_history.append({"role": "assistant", "content": None, "tool_calls": [
                {"id": "call_1", "type": "function", "function": {"name": tool_name, "arguments": "{}"}}
            ]})
            # Simulate tool response
            conversation_history.append({
                "role": "tool",
                "tool_call_id": "call_1", 
                "content": f"[Simulated response from {tool_name}]"
            })
        else:
            conversation_history.append({"role": "assistant", "content": assistant_message.content or ""})
        
        turn_details.append({
            "turn": turn_num,
            "prompt_tokens": usage.prompt_tokens,
            "completion_tokens": usage.completion_tokens,
            "total_tokens": usage.total_tokens,
        })
        
        print(f"   ‚úÖ Tokens used: {usage.total_tokens:,} (prompt: {usage.prompt_tokens:,})")
    
    # Summary
    print("\n" + "=" * 70)
    print("üéüÔ∏è  Token Usage Summary (Traditional - All Tools Every Turn)")
    print("=" * 70)
    
    for detail in turn_details:
        print(f"   Turn {detail['turn']}: {detail['total_tokens']:,} tokens (prompt: {detail['prompt_tokens']:,})")
    
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   TOTAL (3 turns): {total_tokens_all_turns:,} tokens")
    
    print(f"\n‚ö†Ô∏è  Problem Analysis:")
    print(f"   Tool schema tokens:     {tools_token_count:,} (per turn)")
    print(f"   Schema √ó 3 turns:       ~{tools_token_count * 3:,} tokens")
    print(f"   Actual total:           {total_tokens_all_turns:,} tokens")
    print(f"   Schema overhead:        ~{(tools_token_count * 3 / total_tokens_all_turns * 100):.1f}%")
    
    return total_tokens_all_turns, len(azure_mcp_tools), tools_token_count

# Run the example
total_tokens_schema_bloat, num_tools, schema_tokens = traditional_tool_schema_bloat_example()

üìä 3.1 [Anti-Pattern] Remote MCP Schema Bloat

üîß Simulated Azure MCP Server:
   Total tools: 53
   Schema tokens: 7,309

üîÑ Turn 1: List all my storage containers in the 'production-...
   ‚úÖ Tokens used: 2,179 (prompt: 2,113)

üîÑ Turn 2: Now show me all VMs in the same resource group....
   ‚úÖ Tokens used: 2,244 (prompt: 2,197)

üîÑ Turn 3: Finally, get the connection string secret from my ...
   ‚úÖ Tokens used: 2,324 (prompt: 2,264)

üéüÔ∏è  Token Usage Summary (Traditional - All Tools Every Turn)
   Turn 1: 2,179 tokens (prompt: 2,113)
   Turn 2: 2,244 tokens (prompt: 2,197)
   Turn 3: 2,324 tokens (prompt: 2,264)
   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
   TOTAL (3 turns): 6,747 tokens

‚ö†Ô∏è  Problem Analysis:
   Tool schema tokens:     7,309 (per turn)
   Schema √ó 3 turns:       ~21,927 tokens
   Actual total:           6,747 tokens
   Schema overhead:        ~325.0%


### 3.2 [Best Practice] Tool Filtering + History Limit

Instead of loading ALL tools from a large MCP server, we can:
1. **Filter tools on-demand** - Only load tools relevant to the current task
2. **Limit conversation history** - Keep only the last N turns to reduce context size

This approach dramatically reduces token consumption while maintaining functionality.

| Optimization | Anti-Pattern (3.1) | Best Practice (3.2) |
|--------------|-------------|-------------|
| Tools loaded | ALL (50+) | Filtered (5-10) |
| History kept | ALL turns | Last 3 turns |
| Schema tokens | ~15,000/turn | ~2,000/turn |

In [14]:
# Example 2.4: Optimized Approach - Tool Filtering + History Limit
# This demonstrates how to reduce context by filtering tools and limiting history

from agent_framework.azure import AzureAIClient
from azure.ai.projects.models import CodeInterpreterTool

def filter_tools_by_intent(all_tools: list[dict], user_query: str) -> list[dict]:
    """
    Filter tools based on user intent detected from the query.
    
    In production, this could use:
    - Keyword matching
    - Embedding similarity
    - LLM-based classification
    """
    query_lower = user_query.lower()
    
    # Define tool categories and their keywords
    tool_categories = {
        "storage": ["storage", "blob", "container", "upload", "download", "file"],
        "compute": ["vm", "virtual machine", "start", "stop", "scale", "compute"],
        "security": ["keyvault", "secret", "key", "certificate", "vault", "password"],
        "database": ["cosmos", "sql", "database", "query", "table"],
        "monitoring": ["monitor", "log", "metric", "alert", "diagnostic"],
    }
    
    # Detect relevant categories
    relevant_categories = set()
    for category, keywords in tool_categories.items():
        if any(kw in query_lower for kw in keywords):
            relevant_categories.add(category)
    
    # If no category detected, return a minimal default set
    if not relevant_categories:
        relevant_categories = {"storage"}  # Default fallback
    
    # Filter tools by category
    filtered_tools = []
    for tool in all_tools:
        tool_name = tool.get("function", {}).get("name", "")
        for category in relevant_categories:
            if category in tool_name:
                filtered_tools.append(tool)
                break
    
    return filtered_tools


def limit_conversation_history(history: list[dict], max_turns: int = 3) -> list[dict]:
    """
    Keep only the last N turns of conversation history.
    
    A "turn" consists of a user message and its corresponding assistant response.
    System message is always preserved.
    """
    # Always keep the system message
    system_messages = [m for m in history if m.get("role") == "system"]
    other_messages = [m for m in history if m.get("role") != "system"]
    
    # Calculate messages per turn (user + assistant + potential tool responses)
    # Keep roughly 2-3 messages per turn
    max_messages = max_turns * 3
    
    # Keep only the last max_messages
    trimmed_messages = other_messages[-max_messages:] if len(other_messages) > max_messages else other_messages
    
    return system_messages + trimmed_messages


def optimized_tool_filtering_example():
    """
    Optimized approach: Filter tools by intent and limit history.
    
    This demonstrates:
    - Loading only relevant tools (5-10 instead of 50+)
    - Keeping only last 3 turns of history
    - Significant token savings
    """
    print("=" * 70)
    print("üöÄ 3.2 [Best Practice] Tool Filtering + History Limit")
    print("=" * 70)
    
    # Get the full tool set from Example 3.1
    azure_mcp_tools = simulate_large_mcp_schema()
    full_tools_tokens = estimate_tokens(json.dumps(azure_mcp_tools, indent=2))
    
    print(f"\nüìä Full MCP Server: {len(azure_mcp_tools)} tools ({full_tools_tokens:,} tokens)")
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Same 3-turn conversation as Example 3.1
    base_system_message = {"role": "system", "content": "You are an Azure infrastructure assistant. Help users manage their Azure resources."}
    conversation_history = [base_system_message.copy()]
    
    user_queries = [
        "List all my storage containers in the 'production-rg' resource group.",
        "Now show me all VMs in the same resource group.",
        "Finally, get the connection string secret from my 'prod-keyvault'.",
    ]
    
    total_tokens_optimized = 0
    turn_details = []
    
    for turn_num, query in enumerate(user_queries, 1):
        print(f"\nüîÑ Turn {turn_num}: {query[:50]}...")
        
        # 1. Filter tools based on user intent
        filtered_tools = filter_tools_by_intent(azure_mcp_tools, query)
        filtered_tokens = estimate_tokens(json.dumps(filtered_tools, indent=2))
        print(f"   üîç Filtered: {len(filtered_tools)} tools ({filtered_tokens:,} tokens)")
        
        # 2. Limit conversation history
        limited_history = limit_conversation_history(conversation_history, max_turns=3)
        
        # Add current user message
        limited_history.append({"role": "user", "content": query})
        
        # Make API call with FILTERED tools and LIMITED history
        response = client.chat.completions.create(
            model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
            messages=limited_history,
            tools=filtered_tools,
            tool_choice="auto",
            max_tokens=300,
        )
        
        usage = response.usage
        total_tokens_optimized += usage.total_tokens
        
        # Update full history for next turn
        conversation_history.append({"role": "user", "content": query})
        assistant_message = response.choices[0].message
        
        if assistant_message.tool_calls:
            tool_name = assistant_message.tool_calls[0].function.name
            conversation_history.append({"role": "assistant", "content": None, "tool_calls": [
                {"id": f"call_{turn_num}", "type": "function", "function": {"name": tool_name, "arguments": "{}"}}
            ]})
            conversation_history.append({
                "role": "tool",
                "tool_call_id": f"call_{turn_num}",
                "content": f"[Simulated response from {tool_name}]"
            })
        else:
            conversation_history.append({"role": "assistant", "content": assistant_message.content or ""})
        
        turn_details.append({
            "turn": turn_num,
            "tools_used": len(filtered_tools),
            "prompt_tokens": usage.prompt_tokens,
            "completion_tokens": usage.completion_tokens,
            "total_tokens": usage.total_tokens,
        })
        
        print(f"   ‚úÖ Tokens used: {usage.total_tokens:,} (prompt: {usage.prompt_tokens:,})")
    
    # Summary and comparison
    print("\n" + "=" * 70)
    print("üéüÔ∏è  Token Usage Summary (Optimized - Filtered Tools + Limited History)")
    print("=" * 70)
    
    for detail in turn_details:
        print(f"   Turn {detail['turn']}: {detail['total_tokens']:,} tokens ({detail['tools_used']} tools)")
    
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   TOTAL (3 turns): {total_tokens_optimized:,} tokens")
    
    # Compare with traditional approach from Example 3.1
    print(f"\nüìä Comparison: Traditional (3.1) vs Optimized (3.2)")
    print(f"   Traditional (all tools):    {total_tokens_schema_bloat:,} tokens")
    print(f"   Optimized (filtered):       {total_tokens_optimized:,} tokens")
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    savings = ((total_tokens_schema_bloat - total_tokens_optimized) / total_tokens_schema_bloat) * 100
    print(f"   üí∞ Token Savings:           {savings:.1f}%")
    
    return total_tokens_optimized

# Run the example
total_tokens_filtered = optimized_tool_filtering_example()

üöÄ 3.2 [Best Practice] Tool Filtering + History Limit

üìä Full MCP Server: 53 tools (7,309 tokens)

üîÑ Turn 1: List all my storage containers in the 'production-...
   üîç Filtered: 2 tools (513 tokens)
   ‚úÖ Tokens used: 295 (prompt: 243)

üîÑ Turn 2: Now show me all VMs in the same resource group....
   üîç Filtered: 0 tools (1 tokens)
   ‚úÖ Tokens used: 244 (prompt: 115)

üîÑ Turn 3: Finally, get the connection string secret from my ...
   üîç Filtered: 0 tools (1 tokens)
   ‚úÖ Tokens used: 339 (prompt: 262)

üéüÔ∏è  Token Usage Summary (Optimized - Filtered Tools + Limited History)
   Turn 1: 295 tokens (2 tools)
   Turn 2: 244 tokens (0 tools)
   Turn 3: 339 tokens (0 tools)
   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
   TOTAL (3 turns): 878 tokens

üìä Comparison: Traditional (3.1) vs Optimized (3.2)
   Traditional (all tools):    6,747 tokens
   Optimized (filtered):       878 toke

---

### Section 4: Context Compression

This section compares **uncontrolled context growth** (anti-pattern) vs **intelligent context compression** (best practice).

### 4.1 [Anti-Pattern] Inappropriate Large Context

As conversations grow longer, the context window fills up with **full conversation history**. This leads to:

1. **Exponential token growth** - Each turn adds more tokens
2. **Irrelevant old context** - Early messages may no longer be relevant
3. **Wasted capacity** - Less room for actual work
4. **Higher costs** - More tokens = higher API costs

**Example of Uncontrolled Growth:**
```
Turn 1: System (500) + User (100) + Assistant (200) = 800 tokens
Turn 2: Previous (800) + User (100) + Assistant (200) = 1,100 tokens
Turn 3: Previous (1,100) + User (100) + Assistant (200) = 1,400 tokens
...
Turn 10: Previous context keeps growing = 5,000+ tokens
```

In [15]:
# 4.1 [Anti-Pattern]: Inappropriate Large Context - Uncontrolled Growth
# This demonstrates how context grows without proper management

from openai import AzureOpenAI

def demonstrate_context_growth():
    """
    Demonstrate the problem of uncontrolled context growth.
    
    This simulates a multi-turn conversation where:
    - Full history is kept for every turn
    - Context grows linearly with each turn
    - No optimization is applied
    """
    print("=" * 70)
    print("‚ö†Ô∏è  4.1 [Anti-Pattern] Inappropriate Large Context")
    print("=" * 70)
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Start with a detailed system message
    system_message = """You are a comprehensive Azure infrastructure assistant. 
You help users with all aspects of Azure resource management including:
- Virtual Machines (VMs): creation, sizing, networking, disk management
- Storage Accounts: blob storage, file shares, queues, tables
- Networking: VNets, subnets, NSGs, load balancers, VPN gateways
- Databases: Azure SQL, Cosmos DB, PostgreSQL, MySQL
- Security: Key Vault, Managed Identities, RBAC, Azure AD
- Monitoring: Azure Monitor, Log Analytics, Application Insights
- DevOps: Azure DevOps, GitHub Actions, Container Registry, AKS

Always provide detailed explanations with examples and best practices."""
    
    conversation_history = [{"role": "system", "content": system_message}]
    
    # Simulate a multi-turn conversation
    user_queries = [
        "I need to set up a new web application. What Azure resources should I use?",
        "Tell me more about the VM sizing options and which one fits my needs.",
        "How do I configure the networking for high availability?",
        "What about security best practices for the storage account?",
        "Can you explain the monitoring setup in more detail?",
    ]
    
    # Detailed responses to simulate realistic conversation
    simulated_responses = [
        "For a new web application, I recommend using Azure App Service for the frontend, Azure SQL Database for persistent storage, Azure Blob Storage for static files, and Azure CDN for content delivery. You should also consider Azure Key Vault for secrets management and Application Insights for monitoring...",
        "For VM sizing, Azure offers various series: B-series for burstable workloads, D-series for general purpose, E-series for memory-optimized, and F-series for compute-optimized. Based on your web application needs, I'd recommend starting with D2s_v5 (2 vCPU, 8GB RAM) and scaling as needed...",
        "For high availability networking, you should deploy across multiple Availability Zones, use Azure Load Balancer or Application Gateway for traffic distribution, implement Network Security Groups (NSGs) for access control, and consider Azure Front Door for global load balancing...",
        "Storage security best practices include: enabling encryption at rest with customer-managed keys, using Private Endpoints instead of public access, implementing Azure AD authentication, configuring CORS policies, enabling soft delete for blob protection, and setting up access policies with SAS tokens...",
        "For monitoring, set up Application Insights for application-level telemetry, Azure Monitor for infrastructure metrics, Log Analytics workspace for centralized logging, create custom dashboards in Azure Portal, set up alerts for critical metrics, and use Azure Monitor Workbooks for reporting...",
    ]
    
    print(f"\nüìù Initial System Message: {len(system_message)} characters")
    print(f"   Tokens: ~{estimate_tokens(system_message):,}")
    
    total_tokens_anti_pattern = 0
    turn_details = []
    
    for turn_num, (query, sim_response) in enumerate(zip(user_queries, simulated_responses), 1):
        print(f"\nüîÑ Turn {turn_num}: {query[:50]}...")
        
        # Add user message
        conversation_history.append({"role": "user", "content": query})
        
        # Calculate context size BEFORE API call
        context_json = json.dumps(conversation_history)
        context_tokens = estimate_tokens(context_json)
        
        # Make API call with FULL history (anti-pattern)
        response = client.chat.completions.create(
            model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
            messages=conversation_history,
            max_tokens=300,
        )
        
        usage = response.usage
        total_tokens_anti_pattern += usage.total_tokens
        
        # Add response to history (simulating realistic response)
        assistant_content = response.choices[0].message.content or sim_response
        conversation_history.append({"role": "assistant", "content": assistant_content})
        
        turn_details.append({
            "turn": turn_num,
            "context_tokens": context_tokens,
            "prompt_tokens": usage.prompt_tokens,
            "total_tokens": usage.total_tokens,
            "history_length": len(conversation_history),
        })
        
        print(f"   üìä Context size: {context_tokens:,} tokens")
        print(f"   ‚úÖ API tokens: {usage.total_tokens:,} (prompt: {usage.prompt_tokens:,})")
    
    # Summary
    print("\n" + "=" * 70)
    print("üéüÔ∏è  Token Usage Summary (Anti-Pattern - Full History Every Turn)")
    print("=" * 70)
    
    print(f"\n   Context Growth Over Time:")
    for detail in turn_details:
        bar = "‚ñà" * (detail['context_tokens'] // 200)
        print(f"   Turn {detail['turn']}: {detail['context_tokens']:,} tokens {bar}")
    
    print(f"\n   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   TOTAL API tokens (5 turns): {total_tokens_anti_pattern:,}")
    
    # Calculate growth rate
    first_turn = turn_details[0]['context_tokens']
    last_turn = turn_details[-1]['context_tokens']
    growth_pct = ((last_turn - first_turn) / first_turn) * 100
    
    print(f"\n‚ö†Ô∏è  Problem Analysis:")
    print(f"   Context growth: {first_turn:,} ‚Üí {last_turn:,} tokens ({growth_pct:.0f}% increase)")
    print(f"   Messages in history: {len(conversation_history)}")
    print(f"   Average tokens/turn: {total_tokens_anti_pattern // 5:,}")
    
    return total_tokens_anti_pattern, turn_details

# Run the anti-pattern example
total_tokens_anti_pattern, context_growth_details = demonstrate_context_growth()

‚ö†Ô∏è  4.1 [Anti-Pattern] Inappropriate Large Context

üìù Initial System Message: 639 characters
   Tokens: ~140

üîÑ Turn 1: I need to set up a new web application. What Azure...
   üìä Context size: 183 tokens
   ‚úÖ API tokens: 470 (prompt: 170)

üîÑ Turn 2: Tell me more about the VM sizing options and which...
   üìä Context size: 544 tokens
   ‚úÖ API tokens: 793 (prompt: 493)

üîÑ Turn 3: How do I configure the networking for high availab...
   üìä Context size: 915 tokens
   ‚úÖ API tokens: 1,111 (prompt: 811)

üîÑ Turn 4: What about security best practices for the storage...
   üìä Context size: 1,280 tokens
   ‚úÖ API tokens: 1,429 (prompt: 1,129)

üîÑ Turn 5: Can you explain the monitoring setup in more detai...
   üìä Context size: 1,660 tokens
   ‚úÖ API tokens: 1,747 (prompt: 1,447)

üéüÔ∏è  Token Usage Summary (Anti-Pattern - Full History Every Turn)

   Context Growth Over Time:
   Turn 1: 183 tokens 
   Turn 2: 544 tokens ‚ñà‚ñà
   Turn 3: 915 tokens ‚ñà‚ñ

### 4.2 [Best Practice] Context Compression

Context compression takes optimization further by **summarizing conversation history** before sending to the model. This approach:

1. **Compresses past turns** into concise summaries
2. **Preserves key information** while reducing tokens
3. **Uses LLM** to generate intelligent summaries

| Turn | Anti-Pattern (4.1) | Best Practice (4.2) |
|------|---------------------|----------------------|
| 1 | Full message | Full message |
| 2 | Full message | Full message |
| 3+ | Full messages (growing) | Compressed summary |

In [16]:
# Example 2.5: Advanced Optimization - Context Compression
# This demonstrates compressing conversation history to reduce context size

from agent_framework.azure import AzureAIClient
from azure.ai.projects.models import CodeInterpreterTool

def compress_conversation_history(
    client: AzureOpenAI,
    history: list[dict],
    keep_recent: int = 2
) -> list[dict]:
    """
    Compress older conversation history into a summary.
    
    Args:
        client: Azure OpenAI client for generating summaries
        history: Full conversation history
        keep_recent: Number of recent turns to keep uncompressed
    
    Returns:
        Compressed history with summary of older turns
    """
    # Separate system message
    system_messages = [m for m in history if m.get("role") == "system"]
    other_messages = [m for m in history if m.get("role") != "system"]
    
    # If history is short enough, no compression needed
    messages_per_turn = 3  # user + assistant + tool response
    if len(other_messages) <= keep_recent * messages_per_turn:
        return history
    
    # Split into old (to compress) and recent (to keep)
    split_point = len(other_messages) - (keep_recent * messages_per_turn)
    old_messages = other_messages[:split_point]
    recent_messages = other_messages[split_point:]
    
    # Generate summary of old messages
    old_text = ""
    for msg in old_messages:
        role = msg.get("role", "unknown")
        content = msg.get("content", "")
        if content:
            old_text += f"{role}: {content}\n"
    
    # Use LLM to compress
    compression_response = client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
        messages=[
            {"role": "system", "content": "Summarize the following conversation in 2-3 concise sentences, preserving key facts and context."},
            {"role": "user", "content": old_text}
        ],
        max_tokens=150,
    )
    
    summary = compression_response.choices[0].message.content
    
    # Create compressed history
    compressed_history = system_messages.copy()
    compressed_history.append({
        "role": "system",
        "content": f"[Previous conversation summary: {summary}]"
    })
    compressed_history.extend(recent_messages)
    
    return compressed_history


def context_compression_example():
    """
    Advanced optimization: Compress conversation history.
    
    This demonstrates:
    - Summarizing older conversation turns
    - Keeping recent context intact
    - Combining with tool filtering for maximum savings
    """
    print("=" * 70)
    print("üóúÔ∏è  4.2 [Best Practice] Context Compression")
    print("=" * 70)
    
    # Get filtered tools (from 3.2 approach)
    azure_mcp_tools = simulate_large_mcp_schema()
    
    # Create Azure OpenAI client
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Simulate a longer 5-turn conversation
    base_system_message = {"role": "system", "content": "You are an Azure infrastructure assistant."}
    conversation_history = [base_system_message.copy()]
    
    user_queries = [
        "List all my storage containers in 'production-rg'.",
        "Upload a test file to the 'data' container.",
        "Now show me all VMs in the resource group.",
        "Start the VM named 'web-server-01'.",
        "Finally, get the database connection string from 'prod-keyvault'.",
    ]
    
    total_tokens_compressed = 0
    total_compression_tokens = 0
    turn_details = []
    
    for turn_num, query in enumerate(user_queries, 1):
        print(f"\nüîÑ Turn {turn_num}: {query[:50]}...")
        
        # 1. Filter tools
        filtered_tools = filter_tools_by_intent(azure_mcp_tools, query)
        
        # 2. Compress history (keep last 2 turns uncompressed)
        if turn_num > 2:
            original_history_tokens = estimate_tokens(json.dumps(conversation_history))
            compressed_history = compress_conversation_history(client, conversation_history, keep_recent=2)
            compressed_history_tokens = estimate_tokens(json.dumps(compressed_history))
            compression_savings = original_history_tokens - compressed_history_tokens
            print(f"   üóúÔ∏è  History compressed: {original_history_tokens:,} ‚Üí {compressed_history_tokens:,} tokens (saved {compression_savings:,})")
        else:
            compressed_history = conversation_history.copy()
        
        # Add current query
        compressed_history.append({"role": "user", "content": query})
        
        # Make API call
        response = client.chat.completions.create(
            model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
            messages=compressed_history,
            tools=filtered_tools,
            tool_choice="auto",
            max_tokens=300,
        )
        
        usage = response.usage
        total_tokens_compressed += usage.total_tokens
        
        # Update full history (uncompressed, for reference)
        conversation_history.append({"role": "user", "content": query})
        assistant_message = response.choices[0].message
        
        if assistant_message.tool_calls:
            tool_name = assistant_message.tool_calls[0].function.name
            conversation_history.append({"role": "assistant", "content": None, "tool_calls": [
                {"id": f"call_{turn_num}", "type": "function", "function": {"name": tool_name, "arguments": "{}"}}
            ]})
            conversation_history.append({
                "role": "tool",
                "tool_call_id": f"call_{turn_num}",
                "content": f"[Simulated response from {tool_name}]"
            })
        else:
            conversation_history.append({"role": "assistant", "content": assistant_message.content or ""})
        
        turn_details.append({
            "turn": turn_num,
            "tools_used": len(filtered_tools),
            "total_tokens": usage.total_tokens,
        })
        
        print(f"   ‚úÖ Tokens used: {usage.total_tokens:,} ({len(filtered_tools)} tools)")
    
    # Summary
    print("\n" + "=" * 70)
    print("üéüÔ∏è  Token Usage Summary (Compressed + Filtered)")
    print("=" * 70)
    
    for detail in turn_details:
        print(f"   Turn {detail['turn']}: {detail['total_tokens']:,} tokens ({detail['tools_used']} tools)")
    
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"   TOTAL (5 turns): {total_tokens_compressed:,} tokens")
    
    # Final comparison
    print(f"\n" + "=" * 70)
    print("üìä Final Comparison: All Optimization Approaches")
    print("=" * 70)
    
    # Estimate what traditional would cost for 5 turns
    estimated_traditional_5turns = int(total_tokens_schema_bloat * 5 / 3)  # Extrapolate from 3-turn data
    
    print(f"   Traditional (3.1, 3 turns):      {total_tokens_schema_bloat:,} tokens")
    print(f"   Filtered (3.2, 3 turns):         {total_tokens_filtered:,} tokens")
    print(f"   Compressed (4.2, 5 turns):       {total_tokens_compressed:,} tokens")
    print(f"   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    
    # Calculate savings vs traditional (normalized per turn)
    trad_per_turn = total_tokens_schema_bloat / 3
    compressed_per_turn = total_tokens_compressed / 5
    savings_vs_trad = ((trad_per_turn - compressed_per_turn) / trad_per_turn) * 100
    
    print(f"\n   Per-turn average:")
    print(f"   Traditional:    {trad_per_turn:,.0f} tokens/turn")
    print(f"   Compressed:     {compressed_per_turn:,.0f} tokens/turn")
    print(f"   üí∞ Savings:     {savings_vs_trad:.1f}% per turn")
    
    return total_tokens_compressed

# Run the example
total_tokens_compressed = context_compression_example()

üóúÔ∏è  4.2 [Best Practice] Context Compression

üîÑ Turn 1: List all my storage containers in 'production-rg'....
   ‚úÖ Tokens used: 283 (2 tools)

üîÑ Turn 2: Upload a test file to the 'data' container....
   ‚úÖ Tokens used: 390 (2 tools)

üîÑ Turn 3: Now show me all VMs in the resource group....
   üóúÔ∏è  History compressed: 230 ‚Üí 230 tokens (saved 0)
   ‚úÖ Tokens used: 290 (0 tools)

üîÑ Turn 4: Start the VM named 'web-server-01'....
   üóúÔ∏è  History compressed: 370 ‚Üí 383 tokens (saved -13)
   ‚úÖ Tokens used: 394 (0 tools)

üîÑ Turn 5: Finally, get the database connection string from '...
   üóúÔ∏è  History compressed: 464 ‚Üí 423 tokens (saved 41)
   ‚úÖ Tokens used: 467 (0 tools)

üéüÔ∏è  Token Usage Summary (Compressed + Filtered)
   Turn 1: 283 tokens (2 tools)
   Turn 2: 390 tokens (2 tools)
   Turn 3: 290 tokens (0 tools)
   Turn 4: 394 tokens (0 tools)
   Turn 5: 467 tokens (0 tools)
   ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

## Best Practices Summary

### When to Use Each Optimization

| Optimization | Use When | Avoid When |
|--------------|----------|------------|
| **Code Execution** | Large datasets (>1000 rows), multi-step workflows | Simple queries, small data |
| **Tool Filtering** | Many tools available (10+), varied user intents | Few tools, consistent tool usage |
| **History Limiting** | Long conversations (5+ turns), repetitive queries | Context-dependent tasks |
| **Context Compression** | Very long sessions, returning users | Short interactions |

### Security Considerations for Code Execution

When implementing code execution with MCP, pay careful attention to security:

| Security Layer | Implementation |
|---------------|----------------|
| **Sandboxing** | Run code in isolated containers (e.g., Docker with limited permissions) |
| **Resource Limits** | Set memory, CPU, time, and disk limits |
| **Code Validation** | Check for dangerous patterns before execution |
| **Network Isolation** | Disable or restrict network access in sandbox |
| **Monitoring** | Log all executions and alert on suspicious patterns |

## Wrap-up

### Key Takeaways

This notebook demonstrated **Anti-Pattern vs Best Practice** pairs for context optimization:

| Section | Anti-Pattern | Best Practice | Token Savings |
|---------|--------------|---------------|---------------|
| **1. Single Code Execution** | Full data in context | Data in sandbox | ~90% |
| **2. Multi-Step Workflows** | 4 round trips | Single pipeline | ~95% |
| **3. MCP Optimization** | All tools loaded | Filtered by intent | ~60-70% |
| **4. Context Compression** | Full history | Compressed summary | ~70-80% |

### Examples Reference

| Section | Example | Approach | Key Technique |
|---------|---------|----------|---------------|
| Part 1 | 1.1-1.2 | Warming Up | MCP connection basics |
| Part 1 | 1.3 | Warming Up | Context window understanding |
| Part 2.1 | 1.1 [Anti] | Traditional | Full data in context |
| Part 2.1 | 1.2 [Best] | Code Exec | Data generation in sandbox |
| Part 2.2 | 2.1 [Anti] | Traditional | 4 round trips |
| Part 2.2 | 2.2 [Best] | Code Exec | Single pipeline execution |
| Part 2.3 | 3.1 [Anti] | Traditional | MCP schema bloat |
| Part 2.3 | 3.2 [Best] | Optimized | Tool filtering + history limit |
| Part 2.4 | 4.1 [Anti] | Traditional | Uncontrolled context growth |
| Part 2.4 | 4.2 [Best] | Optimized | Context compression |

### Summary: Context Optimization Strategies

1. **Code Execution**: Process data in sandbox, return only summaries
2. **Tool Filtering**: Load only relevant tools based on user intent
3. **History Limiting**: Keep only recent N turns of conversation
4. **Context Compression**: Summarize older conversations with LLM
```

## Additional Resources

- [Model Context Protocol Documentation](https://modelcontextprotocol.io/)
- [Microsoft Agent Framework - Using MCP Tools](https://learn.microsoft.com/en-us/agent-framework/user-guide/model-context-protocol/using-mcp-tools)
- [Code Execution with MCP - Anthropic Engineering](https://www.anthropic.com/engineering/code-execution-with-mcp)
- [MCP GitHub Repository](https://github.com/modelcontextprotocol)