# Azure Monitor for AI Agent Service - Failures & Alerts

This notebook helps on how to:
- Query diagnostic logs using KQL (Kusto Query Language)
- Monitor failures and errors in Azure AI Agent Service
- Create alerts for critical issues
- Build dashboards for operational insights

## Prerequisites
- Diagnostic settings enabled on your AI Foundry resource
- Log Analytics workspace configured
- Azure CLI or Azure SDK for Python

## Setup and Configuration

In [None]:
# Install required packages
%pip install azure-monitor-query azure-identity azure-mgmt-monitor azure-mgmt-loganalytics pandas matplotlib python-dotenv

In [87]:
from azure.identity import DefaultAzureCredential
from azure.monitor.query import LogsQueryClient, LogsQueryStatus
from datetime import datetime, timedelta
import pandas as pd
import json
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configuration - Update these values
WORKSPACE_ID = "YOUR_WORKSPACE_ID"  # Replace with your workspace ID
WORKSPACE_NAME = "YOUR_WORKSPACE_NAME"  # Replace with your workspace name
SUBSCRIPTION_ID = "{SUBSCRIPTION_ID}"  # Replace with your subscription ID
RESOURCE_GROUP = "{RESOURCE_GROUP}"  # Replace with your resource group
AI_FOUNDRY_RESOURCE = "gk-agent-framework-project"  # Your AI Foundry resource name

# Initialize clients
credential = DefaultAzureCredential()
logs_client = LogsQueryClient(credential)

## KQL Queries for Failure Detection

### 0. Discover Available Columns (Run this first!)

In [49]:
# First, let's discover what columns are available in our logs
query_discover = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| take 1
"""

def execute_query(query, timespan=timedelta(hours=24)):
    """Execute a KQL query against Log Analytics workspace"""
    try:
        response = logs_client.query_workspace(
            workspace_id=WORKSPACE_ID,
            query=query,
            timespan=timespan
        )
        
        if response.status == LogsQueryStatus.SUCCESS:
            data = response.tables
            if data and len(data) > 0:
                table = data[0]
                # Get column names - handle both string and object types
                if hasattr(table.columns[0], 'name'):
                    columns = [col.name for col in table.columns]
                else:
                    columns = table.columns
                df = pd.DataFrame(table.rows, columns=columns)
                return df
            else:
                print("No data returned from query")
                return None
        else:
            print(f"Query failed with status: {response.status}")
            return None
    except Exception as e:
        print(f"Error executing query: {e}")
        import traceback
        traceback.print_exc()
        return None

# Discover available columns
discover_df = execute_query(query_discover)
if discover_df is not None:
    print("Available columns in AzureDiagnostics:")
    print("\n".join(sorted(discover_df.columns.tolist())))
    print(f"\nTotal columns: {len(discover_df.columns)}")
    print("\nSample data:")
    display(discover_df.head())
else:
    print("Unable to query. Check your workspace ID and credentials.")

Available columns in AzureDiagnostics:
AdditionalFields
AssetIdentity_g
CallerIPAddress
Caller_s
Category
Computer
CorrelationId
DatabaseName_s
DurationMs
ElasticPoolName_s
EventName_s
JobId_g
Level
LogicalServerName_s
MG
ManagementGroupName
Message
OperationName
OperationVersion
RawData
Resource
ResourceGroup
ResourceId
ResourceProvider
ResourceType
ResultDescription
ResultSignature
ResultType
RunOn_s
RunbookName_s
SourceSystem
StreamType_s
SubscriptionId
TenantId
Tenant_g
Tenant_s
TimeGenerated
Type
_ResourceId
_schema_s
clientIP_s
clientInfo_s
clientPort_d
code_s
conditions_None_s
conditions_destinationIP_s
conditions_destinationPortRange_s
conditions_protocols_s
conditions_sourceIP_s
conditions_sourcePortRange_s
correlation_actionTrackingId_g
correlation_clientTrackingId_s
direction_s
endTime_t
event_s
host_s
httpMethod_s
httpStatusCode_d
httpStatus_d
httpVersion_s
id_s
identity_claim_appid_g
identity_claim_http_schemas_microsoft_com_claims_authnmethodsreferences_s
identity_claim_h

Unnamed: 0,TenantId,TimeGenerated,ResourceId,Category,ResourceGroup,SubscriptionId,ResourceProvider,Resource,ResourceType,OperationName,...,Computer,RawData,AssetIdentity_g,event_s,location_s,properties_s,Tenant_s,AdditionalFields,Type,_ResourceId
0,YOUR_WORKSPACE_ID,2025-10-07 13:02:16.354000+00:00,/SUBSCRIPTIONS/D7713F12-C2AF-4980-A889-AF28580...,RequestResponse,{RESOURCE_GROUP},{SUBSCRIPTION_ID},MICROSOFT.COGNITIVESERVICES,{AI_FOUNDRY_PROJECT},ACCOUNTS,Create_Run,...,,,,ShoeboxCallResult,eastus2,"{""apiName"":""Azure OpenAI API version 2025-03-0...",eastus2,,AzureDiagnostics,/subscriptions/d7713f12-c2af-4980-a889-af28580...


### Key Findings from Schema Discovery

From the schema above, we have these important columns available:
- **Status/Error tracking**: `ResultType`, `ResultSignature`, `ResultDescription`, `httpStatusCode_d`
- **Request tracking**: `CorrelationId`, `OperationName`, `DurationMs`
- **Client info**: `CallerIPAddress`, `clientIP_s`, `userAgent_s`
- **Resource info**: `Resource`, `ResourceGroup`, `ResourceProvider`
- **Extended data**: `properties_s` (JSON field with API details, tokens, etc.)


In [50]:
# Let's examine the properties_s field to see what additional data is available
query_properties = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where isnotempty(properties_s)
| take 1
| project TimeGenerated, OperationName, properties_s
"""

props_df = execute_query(query_properties)
if props_df is not None and len(props_df) > 0:
    print("Sample properties_s content (contains API details, tokens, etc.):")
    print("\nRaw JSON:")
    print(props_df['properties_s'].iloc[0])
    
    # Try to parse the JSON
    try:
        import json
        props_json = json.loads(props_df['properties_s'].iloc[0])
        print("\n\nParsed JSON keys:")
        print(json.dumps(props_json, indent=2))
    except:
        print("\nCouldn't parse as JSON, but raw data shown above")
else:
    print("No properties_s data found")

Sample properties_s content (contains API details, tokens, etc.):

Raw JSON:
{"apiName":"Azure OpenAI API version 2025-03-01-preview","requestTime":638954276225566653,"requestLength":0,"responseTime":638954276231009442,"responseLength":4010,"objectId":"be5a8606-f5be-4381-b0c8-2274fb41ec08"}


Parsed JSON keys:
{
  "apiName": "Azure OpenAI API version 2025-03-01-preview",
  "requestTime": 638954276225566653,
  "requestLength": 0,
  "responseTime": 638954276231009442,
  "responseLength": 4010,
  "objectId": "be5a8606-f5be-4381-b0c8-2274fb41ec08"
}


### Examine properties_s Variations Across Different Operations

In [51]:
# Let's examine properties_s structure for different operation types
query_operations_properties = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where isnotempty(properties_s)
| summarize 
    SampleProperties = any(properties_s),
    Count = count()
    by OperationName
| order by Count desc
| take 10
"""

ops_props_df = execute_query(query_operations_properties)
if ops_props_df is not None and len(ops_props_df) > 0:
    print("Properties structure by Operation Type:")
    print("=" * 100)
    
    for idx, row in ops_props_df.iterrows():
        print(f"\n{idx + 1}. Operation: {row['OperationName']}")
        print(f"   Count: {row['Count']}")
        print(f"   Sample Properties:")
        try:
            import json
            props_json = json.loads(row['SampleProperties'])
            print(f"   Keys: {', '.join(props_json.keys())}")
            print(f"   Sample: {json.dumps(props_json, indent=6)}")
        except:
            print(f"   Raw: {row['SampleProperties'][:200]}...")
        print("-" * 100)
else:
    print("No operations with properties found")
    
print("\n The properties_s field structure varies by operation type.")

Properties structure by Operation Type:

1. Operation: Embeddings_Create
   Count: 144
   Sample Properties:
   Keys: apiName, requestTime, requestLength, responseTime, responseLength, objectId, streamType, modelDeploymentName, modelName, modelVersion
   Sample: {
      "apiName": "Azure OpenAI API version 2025-01-01-preview",
      "requestTime": 638954397675903821,
      "requestLength": 147,
      "responseTime": 638954397676451904,
      "responseLength": 8416,
      "objectId": "",
      "streamType": "Non-Streaming",
      "modelDeploymentName": "text-embedding-3-small",
      "modelName": "text-embedding-3-small",
      "modelVersion": "1"
}
----------------------------------------------------------------------------------------------------

2. Operation: ChatCompletions_Create
   Count: 39
   Sample Properties:
   Keys: apiName, requestTime, requestLength, responseTime, responseLength, objectId, streamType, modelDeploymentName, modelName, modelVersion
   Sample: {
      "apiNam

### Extracting Token Usage from properties_s

Since token information may be in `properties_s`, here's how to extract it using KQL's `parse_json()` function:

In [54]:
# Extract and analyze ACTUAL available data from properties_s JSON field
# Based on schema discovery: requestLength, responseLength, modelDeploymentName, modelName, streamType, modelVersion
query_advanced_properties = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where isnotempty(properties_s)
| extend Properties = parse_json(properties_s)
| extend 
    ApiName = tostring(Properties.apiName),
    RequestLength = toint(Properties.requestLength),
    ResponseLength = toint(Properties.responseLength),
    ModelDeployment = tostring(Properties.modelDeploymentName),
    ModelName = tostring(Properties.modelName),
    StreamType = tostring(Properties.streamType),
    ModelVersion = tostring(Properties.modelVersion)
| summarize 
    RequestCount = count(),
    TotalRequestBytes = sum(RequestLength),
    TotalResponseBytes = sum(ResponseLength),
    AvgRequestBytes = avg(RequestLength),
    AvgResponseBytes = avg(ResponseLength),
    AvgDurationMs = avg(DurationMs),
    Models = make_set(ModelDeployment)
    by OperationName, ApiName, StreamType
| order by RequestCount desc
"""

advanced_props_df = execute_query(query_advanced_properties)
if advanced_props_df is not None and len(advanced_props_df) > 0:
    print(" Request & Response Analysis (from properties_s):")
    display(advanced_props_df)
    
    # Show summary
    print("\n Summary:")
    print(f"   Total Requests: {advanced_props_df['RequestCount'].sum():,}")
    total_req_mb = advanced_props_df['TotalRequestBytes'].sum() / (1024*1024)
    total_resp_mb = advanced_props_df['TotalResponseBytes'].sum() / (1024*1024)
    print(f"   Total Request Data: {total_req_mb:.2f} MB")
    print(f"   Total Response Data: {total_resp_mb:.2f} MB")
    
    # Convert AvgDurationMs to numeric, handling any non-numeric values
    if 'AvgDurationMs' in advanced_props_df.columns:
        avg_duration = advanced_props_df['AvgDurationMs'].apply(lambda x: float(x) if x != '' and x is not None else 0).mean()
        if avg_duration > 0:
            print(f"   Avg Duration: {avg_duration:.2f} ms")
    
    print(f"   Unique Operations: {len(advanced_props_df)}")
else:
    print("No advanced properties data available")

 Request & Response Analysis (from properties_s):


Unnamed: 0,OperationName,ApiName,StreamType,RequestCount,TotalRequestBytes,TotalResponseBytes,AvgRequestBytes,AvgResponseBytes,AvgDurationMs,Models
0,Embeddings_Create,Azure OpenAI API version 2025-01-01-preview,Non-Streaming,144,100130,1211848,695.347222,8415.611111,306.159722,"[""text-embedding-3-small""]"
1,ChatCompletions_Create,Azure OpenAI API version 2025-01-01-preview,Non-Streaming,39,443370,106981,11368.461538,2743.102564,4164.128205,"[""gpt-4.1""]"
2,Create_Message,Azure OpenAI API version 2025-03-01-preview,,36,121894,133195,3385.944444,3699.861111,242.694444,"[""""]"
3,Creates a new message on a specified thread.,Azure AI Projects API,,36,121894,0,3385.944444,0.0,356.555556,"[""""]"
4,Completions_Create,Azure OpenAI API version 2025-03-01-preview,Streaming,30,1368479,0,45615.966667,0.0,851.1,"[""gpt-4.1""]"
5,Create_Run,Azure OpenAI API version 2025-03-01-preview,,19,100867,0,5308.789474,0.0,414.052632,"[""""]"
6,Creates a new run for an assistant thread.,Azure AI Projects API,,19,89774,0,4724.947368,0.0,671.842105,"[""""]"
7,Creates a new thread. Threads contain messages...,Azure AI Projects API,,19,1182,3955,62.210526,208.157895,266.789474,"[""""]"
8,Create_Thread,Azure OpenAI API version 2025-03-01-preview,,19,6742,3955,354.842105,208.157895,138.157895,"[""""]"
9,Uploads a file for use by other operations.,Azure AI Projects API,,13,170062054,0,13081696.461538,0.0,8147.615385,"[""""]"



 Summary:
   Total Requests: 479
   Total Request Data: 326.69 MB
   Total Response Data: 1.45 MB
   Avg Duration: 1423.58 ms
   Unique Operations: 26


### 1. Query All Failures in Last 24 Hours

In [56]:
# Updated query with commonly available columns
query_all_failures = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where ResultType == "Failed" or ResultType == "Error" or toint(ResultSignature) >= 400
| project TimeGenerated, OperationName, ResultType, ResultSignature, ResultDescription,
          CorrelationId, CallerIPAddress, Resource
| order by TimeGenerated desc
"""

# Execute the query
failures_df = execute_query(query_all_failures)
if failures_df is not None:
    print(f"Found {len(failures_df)} failures in the last 24 hours")
    display(failures_df.head(10))
else:
    print("No failures found or unable to query")

Found 0 failures in the last 24 hours


Unnamed: 0,TimeGenerated,OperationName,ResultType,ResultSignature,ResultDescription,CorrelationId,CallerIPAddress,Resource


### 2. Query Rate Limit Errors (429 Status Codes)

In [57]:
# Updated query for rate limits - using ResultSignature instead of httpStatusCode_d
query_rate_limits = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature == "429" or ResultSignature == "TooManyRequests"
| where TimeGenerated > ago(24h)
| summarize Count=count(), 
            FirstOccurrence=min(TimeGenerated), 
            LastOccurrence=max(TimeGenerated) 
            by OperationName, Resource
| order by Count desc
"""

rate_limits_df = execute_query(query_rate_limits)
if rate_limits_df is not None and len(rate_limits_df) > 0:
    print("Rate Limit Errors by Operation:")
    display(rate_limits_df)
else:
    print("No rate limit errors found")

No rate limit errors found


### 3. Query Authentication Failures (401/403)

In [58]:
# Updated query for authentication failures
query_auth_failures = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature in ("401", "403", "Unauthorized", "Forbidden")
| where TimeGenerated > ago(24h)
| project TimeGenerated, OperationName, ResultSignature, ResultType, ResultDescription,
          CallerIPAddress, CorrelationId
| order by TimeGenerated desc
"""

auth_failures_df = execute_query(query_auth_failures)
if auth_failures_df is not None and len(auth_failures_df) > 0:
    print(f"Found {len(auth_failures_df)} authentication failures")
    display(auth_failures_df)
else:
    print("No authentication failures found")

No authentication failures found


### 4. Query Server Errors (5xx Status Codes)

In [59]:
# Updated query for server errors (5xx)
query_server_errors = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature startswith "5" or ResultType == "Failed" or ResultType == "Error"
| where TimeGenerated > ago(24h)
| summarize ErrorCount=count(), 
            LatestError=max(TimeGenerated),
            SampleDescriptions=make_set(ResultDescription, 5)
            by ResultSignature, OperationName
| order by ErrorCount desc
"""

server_errors_df = execute_query(query_server_errors)
if server_errors_df is not None and len(server_errors_df) > 0:
    print("Server Errors Summary:")
    display(server_errors_df)
else:
    print("No server errors found")

No server errors found


### 5. Query Agent Service Specific Errors

In [60]:
# Updated query for agent-specific errors
query_agent_errors = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where OperationName contains "Agent" or OperationName contains "Thread" or OperationName contains "Run"
| where ResultType == "Failed" or ResultType == "Error"
| where TimeGenerated > ago(24h)
| project TimeGenerated, OperationName, ResultType, ResultSignature, ResultDescription,
          DurationMs, CorrelationId
| order by TimeGenerated desc
"""

agent_errors_df = execute_query(query_agent_errors)
if agent_errors_df is not None and len(agent_errors_df) > 0:
    print(f"Found {len(agent_errors_df)} Agent Service errors")
    display(agent_errors_df.head(10))
else:
    print("No agent service errors found")

No agent service errors found


### 6. Query High Latency Operations (Performance Issues)

In [61]:
# Updated query for high latency operations
query_high_latency = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where DurationMs > 5000  // Operations taking more than 5 seconds
| project TimeGenerated, OperationName, DurationMs, 
          ResultSignature, ResultType, Resource, CorrelationId
| order by DurationMs desc
"""

high_latency_df = execute_query(query_high_latency)
if high_latency_df is not None and len(high_latency_df) > 0:
    print("High Latency Operations (>5s):")
    display(high_latency_df.head(10))
else:
    print("No high latency operations found")

High Latency Operations (>5s):


Unnamed: 0,TimeGenerated,OperationName,DurationMs,ResultSignature,ResultType,Resource,CorrelationId
0,2025-10-07 13:11:57.794000+00:00,Embeddings_Create,12104,200,,{AI_FOUNDRY_PROJECT},00b4a0ea-0d2e-4859-bfdf-8922275ceefd
1,2025-10-07 13:19:12.088000+00:00,ChatCompletions_Create,11367,200,,{AI_FOUNDRY_PROJECT},bf0c9825-42fd-4abb-93a1-e1b267ae7598
2,2025-10-07 13:08:11.457000+00:00,Uploads a file for use by other operations.,10615,200,,{AI_FOUNDRY_PROJECT},36e13e7a-7e3c-49f7-882e-6b97bfa8d3fa
3,2025-10-07 13:14:28.818000+00:00,Uploads a file for use by other operations.,9997,200,,{AI_FOUNDRY_PROJECT},23febb59-9363-4320-b3a2-7e6f04d812a5
4,2025-10-07 13:20:22.534000+00:00,Uploads a file for use by other operations.,9836,200,,{AI_FOUNDRY_PROJECT},a5d41556-76a5-4242-a908-1ffac40e365c
5,2025-10-07 13:08:11.874000+00:00,Files_Upload,9559,200,,{AI_FOUNDRY_PROJECT},30798f34-8b06-421d-984e-bb3d54a4573d
6,2025-10-07 13:12:24.339000+00:00,ChatCompletions_Create,9014,200,,{AI_FOUNDRY_PROJECT},57729785-cb41-4806-93b3-2a59afce0cec
7,2025-10-07 13:20:22.127000+00:00,Files_Upload,8771,200,,{AI_FOUNDRY_PROJECT},0df2cc3b-8c33-43e9-b0d7-987acbda0c1e
8,2025-10-07 13:01:56.598000+00:00,Uploads a file for use by other operations.,8728,200,,{AI_FOUNDRY_PROJECT},01d4193e-36db-4e76-a7f3-481b657309c2
9,2025-10-07 13:12:15.896000+00:00,Embeddings_Create,8629,200,,{AI_FOUNDRY_PROJECT},51de20eb-aabe-4f76-82e4-8d129a40121c


### 7. Error Trends Over Time

In [67]:
# Updated query for error trends
query_error_trends = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "Failed" or ResultType == "Error" or ResultSignature startswith "4" or ResultSignature startswith "5"
| where TimeGenerated > ago(7d)
| summarize ErrorCount=count() by bin(TimeGenerated, 1h), ErrorType=case(
    ResultSignature == "429" or ResultSignature == "TooManyRequests", "RateLimit",
    ResultSignature == "401" or ResultSignature == "403" or ResultSignature == "Unauthorized", "Authentication",
    ResultSignature startswith "5", "ServerError",
    ResultSignature startswith "4", "ClientError",
    "Other"
)
| order by TimeGenerated desc
"""

error_trends_df = execute_query(query_error_trends, timespan=timedelta(days=7))
if error_trends_df is not None and len(error_trends_df) > 0:
    print("Error Trends (Last 7 Days):")
    display(error_trends_df.head(20))
    
    # Visualize if matplotlib is available and there's data
    try:
        import matplotlib.pyplot as plt
        pivot_df = error_trends_df.pivot(index='TimeGenerated', columns='ErrorType', values='ErrorCount')
        pivot_df.plot(kind='line', figsize=(12, 6), title='Error Trends Over Time')
        plt.ylabel('Error Count')
        plt.xlabel('Time')
        plt.legend(title='Error Type')
        plt.grid(True)
        plt.show()
    except ImportError:
        print("Install matplotlib for visualization: pip install matplotlib")
    except TypeError as e:
        print(f"  No data available for plotting: {e}")
else:
    print(" No errors found in the last 7 days - service is healthy!")

 No errors found in the last 7 days - service is healthy!


### 8. Request Usage Analysis (Token data not in AzureDiagnostics)

---

##  Agent-Specific Monitoring (Category: Agents)

Based on Azure AI Foundry Agent Service metrics, monitor agent operations including:
- **Agents**: Events for AI Agents (create, update, delete)
- **Messages**: Events for agent messages by thread
- **Runs**: Agent run status, duration, and outcomes
- **Threads**: Thread lifecycle events
- **ToolCalls**: Tool invocations by agents
- **Tokens**: Token usage by agent and type

### 1. Agent Operations Summary

In [68]:
# Query all agent operations (agents, threads, messages, runs, tools)
query_agent_summary = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Create_Assistant", "Update_Assistant", "Delete_Assistant",
                                "Create_Thread", "Delete_Thread",
                                "Create_Message", "List_Messages",
                                "Create_Run", "Cancel_Run", "Submit_Tool_Outputs",
                                "List_Run_Steps")
| extend Props = parse_json(properties_s)
| summarize 
    Count=count(),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed"),
    AvgDurationMs=avg(DurationMs),
    MaxDurationMs=max(DurationMs),
    UniqueCorrelationIds=dcount(CorrelationId)
    by OperationName
| extend SuccessRate = round(SuccessCount * 100.0 / Count, 2)
| order by Count desc
"""

print(" Agent Operations Summary (Last 24h):")
agent_summary_df = execute_query(query_agent_summary)
if agent_summary_df is not None and len(agent_summary_df) > 0:
    display(agent_summary_df)
    
    print(f"\n Total Agent Operations: {agent_summary_df['Count'].sum():,}")
    print(f" Total Successes: {agent_summary_df['SuccessCount'].sum():,}")
    print(f" Total Failures: {agent_summary_df['FailureCount'].sum():,}")
    
    avg_success_rate = (agent_summary_df['SuccessCount'].sum() / agent_summary_df['Count'].sum() * 100)
    print(f" Overall Success Rate: {avg_success_rate:.2f}%")
else:
    print("No agent operations found")

 Agent Operations Summary (Last 24h):


Unnamed: 0,OperationName,Count,SuccessCount,FailureCount,AvgDurationMs,MaxDurationMs,UniqueCorrelationIds,SuccessRate
0,Create_Message,36,36,0,242.694444,1603,36,100
1,Create_Thread,19,19,0,138.157895,219,19,100
2,Create_Run,19,19,0,414.052632,866,19,100
3,Create_Assistant,10,10,0,274.5,549,10,100
4,Delete_Assistant,10,10,0,152.2,412,10,100



 Total Agent Operations: 94
 Total Successes: 94
 Total Failures: 0
 Overall Success Rate: 100.00%


### 2. Agent Run Status Analysis

In [71]:
# Monitor agent run status (queued, in_progress, completed, failed, cancelled, expired)
query_agent_runs = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName == "Create_Run" or OperationName == "Cancel_Run" or OperationName == "Submit_Tool_Outputs"
| extend Props = parse_json(properties_s)
| extend 
    RunStatus = tostring(Props.status),
    AgentId = tostring(Props.agentId),
    ThreadId = tostring(Props.threadId),
    StreamType = tostring(Props.streamType)
| summarize 
    RunCount=count(),
    AvgDurationMs=avg(DurationMs),
    MaxDurationMs=max(DurationMs),
    Failures=countif(ResultType == "Failed")
    by OperationName, StreamType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
"""

print(" Agent Run Status Analysis (Last 24h):")
agent_runs_df = execute_query(query_agent_runs)
if agent_runs_df is not None and len(agent_runs_df) > 0:
    display(agent_runs_df.head(20))
    
    print(f"\n Summary:")
    print(f"   Total Runs: {agent_runs_df['RunCount'].sum():,}")
    print(f"   Failed Runs: {agent_runs_df['Failures'].sum():,}")
    avg_duration = agent_runs_df['AvgDurationMs'].mean()
    print(f"   Avg Run Duration: {avg_duration:.2f} ms")
else:
    print("No agent run data found")

 Agent Run Status Analysis (Last 24h):


Unnamed: 0,OperationName,StreamType,TimeGenerated,RunCount,AvgDurationMs,MaxDurationMs,Failures
0,Create_Run,,2025-10-07 13:00:00+00:00,19,414.052632,866,0



 Summary:
   Total Runs: 19
   Failed Runs: 0
   Avg Run Duration: 414.05 ms


### 3. Thread and Message Activity

In [72]:
# Monitor thread creation and message activity
query_threads_messages = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Create_Thread", "Delete_Thread", "Create_Message", "List_Messages")
| summarize 
    EventCount=count(),
    UniqueThreads=dcount(CorrelationId),
    AvgDurationMs=avg(DurationMs)
    by OperationName, bin(TimeGenerated, 1h)
| order by TimeGenerated desc, EventCount desc
"""

print(" Thread and Message Activity (Last 24h):")
threads_messages_df = execute_query(query_threads_messages)
if threads_messages_df is not None and len(threads_messages_df) > 0:
    display(threads_messages_df.head(20))
    
    # Calculate totals by operation
    totals = threads_messages_df.groupby('OperationName')['EventCount'].sum().sort_values(ascending=False)
    print(f"\n Totals by Operation:")
    for op, count in totals.items():
        print(f"   {op}: {count:,}")
else:
    print("No thread/message activity found")

 Thread and Message Activity (Last 24h):


Unnamed: 0,OperationName,TimeGenerated,EventCount,UniqueThreads,AvgDurationMs
0,Create_Message,2025-10-07 13:00:00+00:00,36,36,242.694444
1,Create_Thread,2025-10-07 13:00:00+00:00,19,19,138.157895



 Totals by Operation:
   Create_Message: 36
   Create_Thread: 19


### 4. Tool Calls Monitoring

In [73]:
# Monitor tool calls made by agents (Submit_Tool_Outputs, List_Run_Steps)
query_tool_calls = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Submit_Tool_Outputs", "List_Run_Steps")
| extend Props = parse_json(properties_s)
| extend ToolName = tostring(Props.toolName)
| summarize 
    ToolCallCount=count(),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed"),
    AvgDurationMs=avg(DurationMs),
    MaxDurationMs=max(DurationMs)
    by OperationName
| extend SuccessRate = round(SuccessCount * 100.0 / ToolCallCount, 2)
| order by ToolCallCount desc
"""

print(" Tool Calls Monitoring (Last 24h):")
tool_calls_df = execute_query(query_tool_calls)
if tool_calls_df is not None and len(tool_calls_df) > 0:
    display(tool_calls_df)
    
    print(f"\n Summary:")
    print(f"   Total Tool Calls: {tool_calls_df['ToolCallCount'].sum():,}")
    print(f"   Successful Calls: {tool_calls_df['SuccessCount'].sum():,}")
    print(f"   Failed Calls: {tool_calls_df['FailureCount'].sum():,}")
    
    if tool_calls_df['ToolCallCount'].sum() > 0:
        overall_success = (tool_calls_df['SuccessCount'].sum() / tool_calls_df['ToolCallCount'].sum() * 100)
        print(f"   Success Rate: {overall_success:.2f}%")
else:
    print("No tool call data found")

 Tool Calls Monitoring (Last 24h):
No tool call data found
No tool call data found


### 5. Agent Failures and Error Analysis

In [74]:
# Detailed analysis of agent-specific failures
query_agent_failures = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName has_any ("Create_Assistant", "Update_Assistant", "Delete_Assistant",
                                "Create_Thread", "Delete_Thread",
                                "Create_Message", "Create_Run", "Cancel_Run", "Submit_Tool_Outputs")
| where ResultType == "Failed" or ResultSignature startswith "4" or ResultSignature startswith "5"
| extend Props = parse_json(properties_s)
| project 
    TimeGenerated,
    OperationName,
    ResultSignature,
    ResultDescription,
    DurationMs,
    CorrelationId,
    CallerIPAddress
| order by TimeGenerated desc
"""

print(" Agent Failures (Last 24h):")
agent_failures_df = execute_query(query_agent_failures)
if agent_failures_df is not None and len(agent_failures_df) > 0:
    display(agent_failures_df.head(20))
    
    print(f"\n  Found {len(agent_failures_df)} agent failures")
    
    # Group by operation
    failure_by_op = agent_failures_df.groupby('OperationName').size().sort_values(ascending=False)
    print(f"\n Failures by Operation:")
    for op, count in failure_by_op.items():
        print(f"   {op}: {count}")
else:
    print(" No agent failures found - all agent operations successful!")

 Agent Failures (Last 24h):
 No agent failures found - all agent operations successful!
 No agent failures found - all agent operations successful!


---

###  Agent Monitoring Summary

The following agent-specific monitoring capabilities were explored:

#### **Metrics Covered (Category: Agents)**

1. **Agents**: Create, Update, Delete assistant operations
2. **Messages**: Message creation and listing by thread
3. **Runs**: Agent run execution, status, and cancellation
4. **Threads**: Thread lifecycle (create, delete)
5. **ToolCalls**: Submit tool outputs and run steps
6. **Tokens**: Token usage tracking (via Azure Monitor Metrics, not logs)

#### **Key Queries Implemented**

**Agent Operations Summary** - Comprehensive overview of all agent operations
- Success/failure counts
- Success rates per operation
- Average and max duration
- Unique correlation IDs

 **Agent Run Status Analysis** - Monitor agent run execution
- Run counts by operation and stream type
- Hourly trends
- Failure tracking

 **Thread & Message Activity** - Track conversation patterns
- Thread creation/deletion
- Message activity by thread
- Unique thread counts

 **Tool Calls Monitoring** - Tool usage tracking
- Tool call success/failure rates
- Duration metrics
- Tool-specific analysis

 **Agent Failures Analysis** - Detailed error tracking
- Failures by operation type
- Error descriptions and correlation IDs
- IP-based analysis for security



In [75]:

# analyze request patterns
query_request_usage = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName has "ChatCompletions" or OperationName has "Completions" or OperationName has "Embeddings"
| summarize 
    RequestCount=count(),
    AvgDurationMs=avg(DurationMs),
    FailureCount=countif(ResultType == "Failed")
    by Resource, OperationName, bin(TimeGenerated, 1h)
| extend SuccessRate = round((RequestCount - FailureCount) * 100.0 / RequestCount, 2)
| order by TimeGenerated desc
"""

request_usage_df = execute_query(query_request_usage)
if request_usage_df is not None and len(request_usage_df) > 0:
    print("📊 Request Usage Analysis (Last 24h):")
    display(request_usage_df.head(20))
    
    print(f"\n📈 Summary:")
    print(f"   Total Requests: {request_usage_df['RequestCount'].sum():,}")
    print(f"   Total Failures: {request_usage_df['FailureCount'].sum():,}")
    avg_success = request_usage_df['SuccessRate'].mean()
    print(f"   Avg Success Rate: {avg_success:.2f}%")
else:
    print("No request usage data available")
    
print("\n💡 For actual token usage, check:")
print("   - Azure Monitor Metrics → OpenAI metrics")
print("   - Azure Cost Management → Usage details")
print("   - Azure OpenAI Studio → Usage dashboard")

📊 Request Usage Analysis (Last 24h):


Unnamed: 0,Resource,OperationName,TimeGenerated,RequestCount,AvgDurationMs,FailureCount,SuccessRate
0,{AI_FOUNDRY_PROJECT},ChatCompletions_Create,2025-10-07 13:00:00+00:00,39,4164.128205,0,100
1,{AI_FOUNDRY_PROJECT},Embeddings_Create,2025-10-07 13:00:00+00:00,144,306.159722,0,100
2,{AI_FOUNDRY_PROJECT},Completions_Create,2025-10-07 13:00:00+00:00,30,851.1,0,100



📈 Summary:
   Total Requests: 213
   Total Failures: 0
   Avg Success Rate: 100.00%

💡 For actual token usage, check:
   - Azure Monitor Metrics → OpenAI metrics
   - Azure Cost Management → Usage details
   - Azure OpenAI Studio → Usage dashboard


## Create Azure Monitor Alert Rules

In [76]:
# Alert Rule Definitions
alert_rules = [
    {
        "name": "AI-Agent-High-Error-Rate",
        "description": "Alert when error rate exceeds 10% in 15 minutes",
        "query": """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| summarize 
    TotalRequests=count(),
    ErrorRequests=countif(ResultType == "Failed" or ResultType == "Error")
| extend ErrorRate = (ErrorRequests * 100.0) / TotalRequests
| where ErrorRate > 10
        """,
        "frequency": "5m",
        "time_window": "15m",
        "severity": 2,
        "threshold": 0
    },
    {
        "name": "AI-Agent-Rate-Limit-Exceeded",
        "description": "Alert when rate limit errors exceed 5 in 5 minutes",
        "query": """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature == "429" or ResultSignature == "TooManyRequests"
| summarize Count=count()
        """,
        "frequency": "5m",
        "time_window": "5m",
        "severity": 1,
        "threshold": 5
    },
    {
        "name": "AI-Agent-Server-Errors",
        "description": "Alert on any 5xx server errors",
        "query": """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultSignature startswith "5" or ResultType == "Failed"
| summarize Count=count()
        """,
        "frequency": "5m",
        "time_window": "5m",
        "severity": 0,
        "threshold": 0
    },
    {
        "name": "AI-Agent-High-Latency",
        "description": "Alert when average latency exceeds 10 seconds",
        "query": """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| summarize AvgLatency=avg(DurationMs)
        """,
        "frequency": "5m",
        "time_window": "15m",
        "severity": 2,
        "threshold": 10000
    }
]

print("Alert Rules Configuration:")
print(json.dumps(alert_rules, indent=2))

Alert Rules Configuration:
[
  {
    "name": "AI-Agent-High-Error-Rate",
    "description": "Alert when error rate exceeds 10% in 15 minutes",
    "query": "\nAzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| summarize \n    TotalRequests=count(),\n    ErrorRequests=countif(ResultType == \"Failed\" or ResultType == \"Error\")\n| extend ErrorRate = (ErrorRequests * 100.0) / TotalRequests\n| where ErrorRate > 10\n        ",
    "frequency": "5m",
    "time_window": "15m",
    "severity": 2,
    "threshold": 0
  },
  {
    "name": "AI-Agent-Rate-Limit-Exceeded",
    "description": "Alert when rate limit errors exceed 5 in 5 minutes",
    "query": "\nAzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where ResultSignature == \"429\" or ResultSignature == \"TooManyRequests\"\n| summarize Count=count()\n        ",
    "frequency": "5m",
    "time_window": "5m",
    "severity": 1,
    "threshold": 5
  },
  {
    "name": "AI-Agent-Se

## Azure CLI Commands for Alert Creation

Run these commands in your terminal to create the alert rules:

In [77]:
# Generate Azure CLI commands
cli_commands = f"""
# ============================================================================
# Azure Monitor Alert Setup for AI Agent Service
# ============================================================================

# 1. Create an action group for notifications
az monitor action-group create \\
  --name "AI-Agent-Alerts" \\
  --resource-group {RESOURCE_GROUP} \\
  --short-name "AIAgent" \\
  --email-receiver "admin" "your-email@example.com"

# 2. Get the action group ID
ACTION_GROUP_ID=$(az monitor action-group show \\
  --name "AI-Agent-Alerts" \\
  --resource-group {RESOURCE_GROUP} \\
  --query id -o tsv)

# 3. Create Alert Rule: High Error Rate
az monitor scheduled-query create \\
  --name "AI-Agent-High-Error-Rate" \\
  --resource-group {RESOURCE_GROUP} \\
  --scopes "/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/$WORKSPACE_ID" \\
  --condition "count 'Placeholder' > 0" \\
  --condition-query "AzureDiagnostics | where ResourceProvider == 'MICROSOFT.COGNITIVESERVICES' | summarize TotalRequests=count(), ErrorRequests=countif(httpStatusCode_d >= 400) | extend ErrorRate = (ErrorRequests * 100.0) / TotalRequests | where ErrorRate > 10" \\
  --description "Alert when error rate exceeds 10% in 15 minutes" \\
  --evaluation-frequency 5m \\
  --window-size 15m \\
  --severity 2 \\
  --action-groups $ACTION_GROUP_ID

# 4. Create Alert Rule: Rate Limit
az monitor scheduled-query create \\
  --name "AI-Agent-Rate-Limit-Exceeded" \\
  --resource-group {RESOURCE_GROUP} \\
  --scopes "/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/$WORKSPACE_ID" \\
  --condition "count 'Placeholder' > 5" \\
  --condition-query "AzureDiagnostics | where ResourceProvider == 'MICROSOFT.COGNITIVESERVICES' | where httpStatusCode_d == 429" \\
  --description "Alert when rate limit errors exceed 5 in 5 minutes" \\
  --evaluation-frequency 5m \\
  --window-size 5m \\
  --severity 1 \\
  --action-groups $ACTION_GROUP_ID

# 5. Create Alert Rule: Server Errors
az monitor scheduled-query create \\
  --name "AI-Agent-Server-Errors" \\
  --resource-group {RESOURCE_GROUP} \\
  --scopes "/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/$WORKSPACE_ID" \\
  --condition "count 'Placeholder' > 0" \\
  --condition-query "AzureDiagnostics | where ResourceProvider == 'MICROSOFT.COGNITIVESERVICES' | where httpStatusCode_d >= 500 and httpStatusCode_d < 600" \\
  --description "Alert on any 5xx server errors" \\
  --evaluation-frequency 5m \\
  --window-size 5m \\
  --severity 0 \\
  --action-groups $ACTION_GROUP_ID

# 6. Create Alert Rule: High Latency
az monitor scheduled-query create \\
  --name "AI-Agent-High-Latency" \\
  --resource-group {RESOURCE_GROUP} \\
  --scopes "/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/$WORKSPACE_ID" \\
  --condition "count 'Placeholder' > 0" \\
  --condition-query "AzureDiagnostics | where ResourceProvider == 'MICROSOFT.COGNITIVESERVICES' | summarize AvgLatency=avg(DurationMs) | where AvgLatency > 10000" \\
  --description "Alert when average latency exceeds 10 seconds" \\
  --evaluation-frequency 5m \\
  --window-size 15m \\
  --severity 2 \\
  --action-groups $ACTION_GROUP_ID

echo "✅ Alert rules created successfully!"
"""

print(cli_commands)
print("\n" + "="*80)
print("⚠️  IMPORTANT: Update the following before running:")
print("   1. Replace 'your-email@example.com' with your actual email")
print("   2. Set WORKSPACE_ID environment variable")
print("   3. Verify SUBSCRIPTION_ID and RESOURCE_GROUP values")
print("="*80)


# Azure Monitor Alert Setup for AI Agent Service

# 1. Create an action group for notifications
az monitor action-group create \
  --name "AI-Agent-Alerts" \
  --resource-group {RESOURCE_GROUP} \
  --short-name "AIAgent" \
  --email-receiver "admin" "your-email@example.com"

# 2. Get the action group ID
ACTION_GROUP_ID=$(az monitor action-group show \
  --name "AI-Agent-Alerts" \
  --resource-group {RESOURCE_GROUP} \
  --query id -o tsv)

# 3. Create Alert Rule: High Error Rate
az monitor scheduled-query create \
  --name "AI-Agent-High-Error-Rate" \
  --resource-group {RESOURCE_GROUP} \
  --scopes "/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/$WORKSPACE_ID" \
  --condition "count 'Placeholder' > 0" \
  --condition-query "AzureDiagnostics | where ResourceProvider == 'MICROSOFT.COGNITIVESERVICES' | summarize TotalRequests=count(), ErrorRequests=countif(httpStatusCode_d >= 400) | extend ErrorRate = (ErrorRequests * 1

In [None]:
import json

# Enhanced Workbook Template with Filters
workbook_template = {
    "version": "Notebook/1.0",
    "items": [
        # Title Section
        {
            "type": 1,
            "content": {
                "json": "# Azure AI Agent Service Monitoring\n\n## Comprehensive monitoring and insights for AI Foundry Agent Service\n\nThis workbook provides deep insights into your AI Agent Service usage, helping you monitor performance, track operations, and optimize your AI infrastructure.\n\n📊 **Features:**\n- Real-time agent operations monitoring\n- Performance metrics and latency analysis\n- Tool calls tracking and success rates\n- Detailed failure analysis\n- Resource usage insights\n\n---\n\n⚠️ **Important:** This workbook queries your Log Analytics workspace.\n\n💡 **Tip:** Use the filters below to focus on specific resource groups, projects, or time periods."
            },
            "name": "text - title"
        },
        # Parameters/Filters Section
        {
            "type": 9,
            "content": {
                "version": "KqlParameterItem/1.0",
                "parameters": [
                    {
                        "id": "time-range-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "TimeRange",
                        "label": "Time Range",
                        "type": 4,
                        "isRequired": True,
                        "value": {"durationMs": 86400000},
                        "typeSettings": {
                            "selectableValues": [
                                {"durationMs": ms} for ms in [300000, 900000, 1800000, 3600000, 14400000, 43200000, 86400000, 172800000, 259200000, 604800000, 1209600000, 2419200000, 2592000000]
                            ],
                            "allowCustom": True
                        }
                    },
                    {
                        "id": "resource-group-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "ResourceGroup",
                        "label": "Resource Group",
                        "type": 2,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": r"AzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where TimeGenerated {TimeRange:query}\n| extend ResourceGroupName = split(_ResourceId, '/')[4]\n| where isnotempty(ResourceGroupName)\n| distinct ResourceGroupName\n| order by ResourceGroupName asc",
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "timeContext": {"durationMs": 0},
                        "timeContextFromParameter": "TimeRange",
                        "defaultValue": "value::all",
                        "queryType": 0,
                        "resourceType": "microsoft.operationalinsights/workspaces"
                    },
                    {
                        "id": "ai-project-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "AIProject",
                        "label": "AI Foundry Project",
                        "type": 2,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": r"AzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where TimeGenerated {TimeRange:query}\n| extend ResourceGroupName = split(_ResourceId, '/')[4]\n| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})\n| distinct Resource\n| order by Resource asc",
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "timeContext": {"durationMs": 0},
                        "timeContextFromParameter": "TimeRange",
                        "defaultValue": "value::all",
                        "queryType": 0,
                        "resourceType": "microsoft.operationalinsights/workspaces"
                    },
                    {
                        "id": "operation-name-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "OperationName",
                        "label": "Operation Name",
                        "type": 2,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": r"AzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where TimeGenerated {TimeRange:query}\n| where Resource in ({AIProject}) or '*' in ({AIProject})\n| where OperationName has_any (\"Create_Assistant\", \"Update_Assistant\", \"Delete_Assistant\", \"Create_Thread\", \"Delete_Thread\", \"Create_Message\", \"List_Messages\", \"Create_Run\", \"Cancel_Run\", \"Submit_Tool_Outputs\", \"List_Run_Steps\")\n| distinct OperationName\n| order by OperationName asc",
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "timeContext": {"durationMs": 0},
                        "timeContextFromParameter": "TimeRange",
                        "defaultValue": "value::all",
                        "queryType": 0,
                        "resourceType": "microsoft.operationalinsights/workspaces"
                    },
                    {
                        "id": "result-type-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "ResultType",
                        "label": "Result Type",
                        "type": 2,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": r"AzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where TimeGenerated {TimeRange:query}\n| where Resource in ({AIProject}) or '*' in ({AIProject})\n| where OperationName in ({OperationName}) or '*' in ({OperationName})\n| distinct ResultType\n| order by ResultType asc",
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "timeContext": {"durationMs": 0},
                        "timeContextFromParameter": "TimeRange",
                        "defaultValue": "value::all",
                        "queryType": 0,
                        "resourceType": "microsoft.operationalinsights/workspaces"
                    },
                    {
                        "id": "location-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "Location",
                        "label": "Location",
                        "type": 2,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": r"AzureDiagnostics\n| where ResourceProvider == \"MICROSOFT.COGNITIVESERVICES\"\n| where TimeGenerated {TimeRange:query}\n| where Resource in ({AIProject}) or '*' in ({AIProject})\n| extend Location = location_s\n| where isnotempty(Location)\n| distinct Location\n| order by Location asc",
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "timeContext": {"durationMs": 0},
                        "timeContextFromParameter": "TimeRange",
                        "defaultValue": "value::all",
                        "queryType": 0,
                        "resourceType": "microsoft.operationalinsights/workspaces"
                    }
                ],
                "style": "pills",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces"
            },
            "name": "parameters - filters"
        },
        # Overview Header
        {
            "type": 1,
            "content": {
                "json": r"---\n\n## 📊 Overview Metrics"
            },
            "name": "text - overview-header"
        },
        # Service Health Overview Tiles
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""let data = AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location});
data
| summarize 
    TotalRequests=count(),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed"),
    AvgDurationMs=avg(DurationMs)
| extend 
    SuccessRate = round(SuccessCount * 100.0 / TotalRequests, 2),
    AvgDurationSec = round(AvgDurationMs / 1000.0, 2)
| project 
    ['Total Requests']=TotalRequests,
    ['Success Count']=SuccessCount,
    ['Failure Count']=FailureCount,
    ['Success Rate %']=SuccessRate,
    ['Avg Duration (sec)']=AvgDurationSec""",
                "size": 3,
                "title": "Service Health Overview",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "tiles",
                "tileSettings": {
                    "titleContent": {"columnMatch": "Total Requests", "formatter": 1},
                    "leftContent": {"columnMatch": "Total Requests", "formatter": 12, "formatOptions": {"palette": "auto"}},
                    "showBorder": True
                }
            },
            "customWidth": "100",
            "name": "query - overview-metrics"
        },
        # Agent Operations Timeline
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Create_Assistant", "Create_Thread", "Create_Message", "Create_Run")
| summarize Count=count() by OperationName, bin(TimeGenerated, 1h)
| order by TimeGenerated desc""",
                "size": 0,
                "title": "Agent Operations Timeline",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "barchart"
            },
            "customWidth": "50",
            "name": "query - operations-timeline"
        },
        # Success Rate Over Time
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has "Agent" or OperationName has "Thread" or OperationName has "Run"
| summarize Total=count(), Success=countif(ResultType != "Failed") by bin(TimeGenerated, 1h)
| extend SuccessRate = round(Success * 100.0 / Total, 2)""",
                "size": 0,
                "title": "Agent Success Rate Over Time",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "linechart"
            },
            "customWidth": "50",
            "name": "query - success-rate"
        },
        # Detailed Analysis Header
        {
            "type": 1,
            "content": {
                "json": r"---\n\n## 🔍 Detailed Analysis"
            },
            "name": "text - details-header"
        },
        # Operations Summary Table
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Create_Assistant", "Update_Assistant", "Delete_Assistant", "Create_Thread", "Delete_Thread", "Create_Message", "List_Messages", "Create_Run", "Cancel_Run", "Submit_Tool_Outputs", "List_Run_Steps")
| extend Props = parse_json(properties_s)
| summarize 
    Count=count(),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed"),
    AvgDurationMs=round(avg(DurationMs), 2),
    MaxDurationMs=max(DurationMs),
    P95DurationMs=round(percentile(DurationMs, 95), 2),
    UniqueCorrelationIds=dcount(CorrelationId)
    by OperationName, Resource
| extend SuccessRate = round(SuccessCount * 100.0 / Count, 2)
| project 
    ['Operation']=OperationName,
    ['Resource']=Resource,
    ['Total Requests']=Count,
    ['Success']=SuccessCount,
    ['Failures']=FailureCount,
    ['Success Rate %']=SuccessRate,
    ['Avg Duration (ms)']=AvgDurationMs,
    ['P95 Duration (ms)']=P95DurationMs,
    ['Max Duration (ms)']=MaxDurationMs,
    ['Unique Sessions']=UniqueCorrelationIds
| order by ['Total Requests'] desc""",
                "size": 0,
                "title": "Agent Operations Summary by Resource",
                "timeContextFromParameter": "TimeRange",
                "showExportToExcel": True,
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "table",
                "gridSettings": {
                    "formatters": [
                        {"columnMatch": "Success Rate %", "formatter": 8, "formatOptions": {"palette": "greenRed"}},
                        {"columnMatch": "Avg Duration (ms)", "formatter": 8, "formatOptions": {"palette": "blue"}}
                    ],
                    "filter": True
                }
            },
            "customWidth": "100",
            "name": "query - operations-summary"
        },
        # Thread Activity
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Create_Thread", "Delete_Thread", "Create_Message", "List_Messages")
| summarize 
    EventCount=count(),
    UniqueThreads=dcount(CorrelationId),
    AvgDurationMs=round(avg(DurationMs), 2)
    by OperationName, bin(TimeGenerated, 1h)
| order by TimeGenerated desc, EventCount desc""",
                "size": 0,
                "title": "Thread & Message Activity Over Time",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "barchart"
            },
            "customWidth": "50",
            "name": "query - thread-activity"
        },
        # Latency Percentiles
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| summarize 
    AvgDurationMs=round(avg(DurationMs), 2),
    P50DurationMs=round(percentile(DurationMs, 50), 2),
    P95DurationMs=round(percentile(DurationMs, 95), 2),
    P99DurationMs=round(percentile(DurationMs, 99), 2),
    MaxDurationMs=max(DurationMs)
    by bin(TimeGenerated, 1h)
| order by TimeGenerated desc""",
                "size": 0,
                "title": "Latency Percentiles Over Time",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "linechart"
            },
            "customWidth": "50",
            "name": "query - latency-percentiles"
        },
        # Tool Calls Header
        {
            "type": 1,
            "content": {
                "json": r"---\n\n## 🛠️ Tool Calls Monitoring"
            },
            "name": "text - tools-header"
        },
        # Tool Activity
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Submit_Tool_Outputs", "List_Run_Steps")
| extend Props = parse_json(properties_s)
| summarize 
    TotalToolCalls=count(),
    UniqueRuns=dcount(CorrelationId),
    AvgDurationMs=round(avg(DurationMs), 2),
    MaxDurationMs=max(DurationMs),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed")
    by OperationName, bin(TimeGenerated, 1h)
| extend SuccessRate = round(SuccessCount * 100.0 / TotalToolCalls, 2)
| order by TimeGenerated desc""",
                "size": 0,
                "title": "Tool Calls Activity Over Time",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "barchart"
            },
            "customWidth": "50",
            "name": "query - tool-activity"
        },
        # Tool Summary
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where ResultType in ({ResultType}) or '*' in ({ResultType})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Submit_Tool_Outputs", "List_Run_Steps")
| summarize 
    TotalCalls=count(),
    SuccessCount=countif(ResultType != "Failed"),
    FailureCount=countif(ResultType == "Failed"),
    AvgDurationMs=round(avg(DurationMs), 2),
    P95DurationMs=round(percentile(DurationMs, 95), 2)
    by OperationName, Resource
| extend SuccessRate = round(SuccessCount * 100.0 / TotalCalls, 2)
| project 
    ['Tool Operation']=OperationName,
    ['Resource']=Resource,
    ['Total Calls']=TotalCalls,
    ['Success']=SuccessCount,
    ['Failures']=FailureCount,
    ['Success Rate %']=SuccessRate,
    ['Avg Duration (ms)']=AvgDurationMs,
    ['P95 Duration (ms)']=P95DurationMs
| order by ['Total Calls'] desc""",
                "size": 0,
                "title": "Tool Calls Summary by Resource",
                "timeContextFromParameter": "TimeRange",
                "showExportToExcel": True,
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "table",
                "gridSettings": {
                    "formatters": [
                        {"columnMatch": "Success Rate %", "formatter": 8, "formatOptions": {"palette": "greenRed"}}
                    ],
                    "filter": True
                }
            },
            "customWidth": "50",
            "name": "query - tool-summary"
        },
        # Failures Header
        {
            "type": 1,
            "content": {
                "json": r"---\n\n## ⚠️ Failures & Errors Analysis"
            },
            "name": "text - failures-header"
        },
        # Tool Call Failures Table
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Submit_Tool_Outputs", "List_Run_Steps")
| where ResultType == "Failed" or httpStatusCode_d >= 400
| extend Props = parse_json(properties_s)
| project 
    ['Time']=TimeGenerated,
    ['Operation']=OperationName,
    ['Resource']=Resource,
    ['Result Type']=ResultType,
    ['Error Code']=ResultSignature,
    ['Description']=ResultDescription,
    ['HTTP Status']=httpStatusCode_d,
    ['Duration (ms)']=DurationMs,
    ['Caller IP']=CallerIPAddress,
    ['Session ID']=CorrelationId
| order by ['Time'] desc""",
                "size": 0,
                "title": "Tool Call Failures (Recent)",
                "timeContextFromParameter": "TimeRange",
                "showExportToExcel": True,
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "table",
                "gridSettings": {
                    "formatters": [
                        {
                            "columnMatch": "HTTP Status",
                            "formatter": 18,
                            "formatOptions": {
                                "thresholdsOptions": "colors",
                                "thresholdsGrid": [
                                    {"operator": ">=", "thresholdValue": "500", "representation": "redBright", "text": "{0}{1}"},
                                    {"operator": ">=", "thresholdValue": "400", "representation": "orange", "text": "{0}{1}"},
                                    {"operator": "Default", "thresholdValue": None, "representation": "green", "text": "{0}{1}"}
                                ]
                            }
                        }
                    ],
                    "filter": True,
                    "sortBy": [{"itemKey": "Time", "sortOrder": 2}]
                }
            },
            "customWidth": "100",
            "name": "query - tool-failures"
        },
        # Failures Analysis Table
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where location_s in ({Location}) or '*' in ({Location})
| where OperationName has_any ("Create_Assistant", "Create_Thread", "Create_Message", "Create_Run", "Submit_Tool_Outputs", "List_Run_Steps")
| where ResultType == "Failed" or httpStatusCode_d >= 400
| extend Props = parse_json(properties_s)
| summarize 
    FailureCount=count(),
    AffectedResources=dcount(Resource),
    UniqueErrors=dcount(ResultSignature),
    AvgDurationMs=round(avg(DurationMs), 2),
    FirstOccurrence=min(TimeGenerated),
    LastOccurrence=max(TimeGenerated)
    by OperationName, ResultSignature, ResultDescription, httpStatusCode_d
| extend 
    ['HTTP Status']=tostring(toint(httpStatusCode_d)),
    ['Time Span']=strcat(format_datetime(FirstOccurrence, 'MM-dd HH:mm'), ' to ', format_datetime(LastOccurrence, 'MM-dd HH:mm'))
| project 
    ['Operation']=OperationName,
    ['Error Code']=ResultSignature,
    ['Description']=ResultDescription,
    ['HTTP Status'],
    ['Failure Count']=FailureCount,
    ['Affected Resources']=AffectedResources,
    ['Avg Duration (ms)']=AvgDurationMs,
    ['Time Span']
| order by ['Failure Count'] desc""",
                "size": 0,
                "title": "Agent & Tool Failures Analysis (Grouped by Error Type)",
                "timeContextFromParameter": "TimeRange",
                "showExportToExcel": True,
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "table",
                "gridSettings": {
                    "formatters": [
                        {"columnMatch": "Failure Count", "formatter": 8, "formatOptions": {"palette": "red"}},
                        {
                            "columnMatch": "HTTP Status",
                            "formatter": 18,
                            "formatOptions": {
                                "thresholdsOptions": "colors",
                                "thresholdsGrid": [
                                    {"operator": ">=", "thresholdValue": "500", "representation": "redBright", "text": "{0}{1}"},
                                    {"operator": ">=", "thresholdValue": "400", "representation": "orange", "text": "{0}{1}"},
                                    {"operator": "Default", "thresholdValue": None, "representation": "green", "text": "{0}{1}"}
                                ]
                            }
                        }
                    ],
                    "filter": True
                }
            },
            "customWidth": "100",
            "name": "query - failures-analysis"
        },
        # Failure Trend
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where location_s in ({Location}) or '*' in ({Location})
| where ResultType == "Failed" or httpStatusCode_d >= 400
| summarize FailureCount=count() by bin(TimeGenerated, 1h), OperationName
| order by TimeGenerated desc""",
                "size": 0,
                "title": "Failure Trend Over Time",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "areachart"
            },
            "customWidth": "50",
            "name": "query - failure-trend"
        },
        # Failures by Status Code
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": r"""AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated {TimeRange:query}
| extend ResourceGroupName = split(_ResourceId, '/')[4]
| where ResourceGroupName in ({ResourceGroup}) or '*' in ({ResourceGroup})
| where Resource in ({AIProject}) or '*' in ({AIProject})
| where OperationName in ({OperationName}) or '*' in ({OperationName})
| where location_s in ({Location}) or '*' in ({Location})
| where ResultType == "Failed" or httpStatusCode_d >= 400
| summarize FailureCount=count() by tostring(toint(httpStatusCode_d))
| order by FailureCount desc""",
                "size": 3,
                "title": "Failures by HTTP Status Code",
                "timeContextFromParameter": "TimeRange",
                "queryType": 0,
                "resourceType": "microsoft.operationalinsights/workspaces",
                "visualization": "piechart"
            },
            "customWidth": "50",
            "name": "query - failures-by-status"
        }
    ],
    "fallbackResourceIds": [
        f"/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{WORKSPACE_NAME}"
    ],
    "$schema": "https://github.com/Microsoft/Application-Insights-Workbooks/blob/master/schema/workbook.json"
}

# Save workbook template
workbook_file = "ai_agent_monitoring_workbook_enhanced.json"
with open(workbook_file, 'w') as f:
    json.dump(workbook_template, f, indent=2)

print(f"✅ Enhanced workbook template saved to: {workbook_file}")
print(f"\\n📊 Workbook includes:")
print("   - 6 interactive filters (Time Range, Resource Group, AI Project, Operation, Result Type, Location)")
print("   - Cascading filter logic")
print("   - Ready for visualization queries")
print(f"\\n💡 Next: Add visualization sections to complete the workbook")

✅ Enhanced workbook template saved to: ai_agent_monitoring_workbook_enhanced.json
\n📊 Workbook includes:
   - 6 interactive filters (Time Range, Resource Group, AI Project, Operation, Result Type, Location)
   - Cascading filter logic
   - Ready for visualization queries
\n💡 Next: Add visualization sections to complete the workbook


## 🎨 Azure OpenAI Insights-Style Workbook

Creating a professional workbook template with:
- Subscription and Resource selectors
- Tab navigation (Overview, Monitor, Insights)
- Rich visualizations (pie charts, tables, time charts)
- Cascading filters with proper dependencies

In [89]:
import json

# Azure OpenAI Insights-Style Workbook for AI Agent Service
workbook_openai_style = {
    "version": "Notebook/1.0",
    "items": [
        # Title Section
        {
            "type": 1,
            "content": {
                "json": "# Azure AI Agent Service Insights\n\nUse this workbook to analyze and monitor all your Azure AI Agent Service resources, including agent operations, thread activities, tool calls, and performance metrics."
            },
            "name": "text - Title"
        },
        
        # Subscription and Resource Selectors
        {
            "type": 9,
            "content": {
                "version": "KqlParameterItem/1.0",
                "parameters": [
                    # Subscription Selector
                    {
                        "id": "subscription-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "Subscription",
                        "label": "Subscriptions",
                        "type": 6,
                        "description": "All subscriptions with Azure AI Services",
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": f"Resources\n| where type =~ \"microsoft.cognitiveservices/accounts\"\n| summarize Count = count() by subscriptionId\n| order by Count desc\n| extend Rank = row_number()\n| project value = subscriptionId, label = subscriptionId, selected = subscriptionId == '{SUBSCRIPTION_ID}'",
                        "crossComponentResources": ["value::selected"],
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "queryType": 1,
                        "resourceType": "microsoft.resourcegraph/resources"
                    },
                    # AI Service Resources Selector  
                    {
                        "id": "resources-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "Resources",
                        "label": "AI Agent Service",
                        "type": 5,
                        "isRequired": True,
                        "multiSelect": True,
                        "quote": "'",
                        "delimiter": ",",
                        "query": "Resources\n| where type =~ \"microsoft.cognitiveservices/accounts\"\n| order by name asc\n| extend Rank = row_number()\n| project value = id, label = id, selected = Rank <= 5",
                        "crossComponentResources": ["{Subscription}"],
                        "typeSettings": {
                            "additionalResourceOptions": ["value::all"],
                            "showDefault": False
                        },
                        "defaultValue": "value::all",
                        "queryType": 1,
                        "resourceType": "microsoft.resourcegraph/resources"
                    },
                    # Time Range
                    {
                        "id": "time-range-param",
                        "version": "KqlParameterItem/1.0",
                        "name": "TimeRange",
                        "label": "Time Range",
                        "type": 4,
                        "isRequired": True,
                        "value": {"durationMs": 86400000},
                        "typeSettings": {
                            "selectableValues": [
                                {"durationMs": ms} for ms in [300000, 900000, 1800000, 3600000, 14400000, 43200000, 86400000, 172800000, 259200000, 604800000, 1209600000, 2419200000, 2592000000]
                            ],
                            "allowCustom": True
                        }
                    }
                ],
                "style": "above",
                "queryType": 0,
                "resourceType": "microsoft.resourcegraph/resources"
            },
            "name": "parameters - 1",
            "styleSettings": {
                "margin": "15px 0 0 0"
            }
        },
        
        # Navigation Tabs
        {
            "type": 11,
            "content": {
                "version": "LinkItem/1.0",
                "style": "tabs",
                "links": [
                    {
                        "id": "overview-tab",
                        "cellValue": "selectedTab",
                        "linkTarget": "parameter",
                        "linkLabel": "Overview",
                        "subTarget": "Overview",
                        "style": "link"
                    },
                    {
                        "id": "monitor-tab",
                        "cellValue": "selectedTab",
                        "linkTarget": "parameter",
                        "linkLabel": "Monitor",
                        "subTarget": "Monitor",
                        "style": "link"
                    },
                    {
                        "id": "insights-tab",
                        "cellValue": "selectedTab",
                        "linkTarget": "parameter",
                        "linkLabel": "Insights",
                        "subTarget": "Insights",
                        "style": "link"
                    }
                ]
            },
            "name": "links - Navigation links",
            "styleSettings": {
                "margin": "10px 0 0 0"
            }
        },
        
        # OVERVIEW TAB
        {
            "type": 12,
            "content": {
                "version": "NotebookGroup/1.0",
                "groupType": "editable",
                "items": [
                    {
                        "type": 1,
                        "content": {
                            "json": "### Azure AI Agent Service Overview\n\nGet comprehensive insights into your AI Agent Service deployment, including resource distribution, agent operations, and performance metrics."
                        },
                        "name": "text - Overview Title"
                    },
                    # Resource Distribution by Subscription
                    {
                        "type": 3,
                        "content": {
                            "version": "KqlItem/1.0",
                            "query": "resources\n| where type =~ \"microsoft.cognitiveservices/accounts\"\n| where name in~ (split(\"{Resources:label}\", \", \"))\n| summarize count(type) by subscriptionId",
                            "size": 3,
                            "title": "Count by Subscription Id",
                            "queryType": 1,
                            "resourceType": "microsoft.resourcegraph/resources",
                            "crossComponentResources": ["{Subscription}"],
                            "visualization": "piechart"
                        },
                        "customWidth": "33",
                        "name": "query - Count by Subscription Id"
                    },
                    # Resource Distribution by Resource Group
                    {
                        "type": 3,
                        "content": {
                            "version": "KqlItem/1.0",
                            "query": "resources\n| where type =~ \"microsoft.cognitiveservices/accounts\"\n| where name in~ (split(\"{Resources:label}\", \", \"))\n| summarize count(type) by resourceGroup",
                            "size": 3,
                            "title": "Count by Resource Group",
                            "queryType": 1,
                            "resourceType": "microsoft.resourcegraph/resources",
                            "crossComponentResources": ["{Subscription}"],
                            "visualization": "piechart"
                        },
                        "customWidth": "33",
                        "name": "query - Count by Resource Group"
                    },
                    # Resource Distribution by Location
                    {
                        "type": 3,
                        "content": {
                            "version": "KqlItem/1.0",
                            "query": "resources\n| where type =~ \"microsoft.cognitiveservices/accounts\"\n| where name in~ (split(\"{Resources:label}\", \", \"))\n| summarize count(type) by location",
                            "size": 3,
                            "title": "Count by Location",
                            "queryType": 1,
                            "resourceType": "microsoft.resourcegraph/resources",
                            "crossComponentResources": ["{Subscription}"],
                            "visualization": "piechart"
                        },
                        "customWidth": "33",
                        "name": "query - Count by Location"
                    },
                    # Resource Details Table
                    {
                        "type": 3,
                        "content": {
                            "version": "KqlItem/1.0",
                            "query": "resources\n| where type == \"microsoft.cognitiveservices/accounts\"\n| where name in~ (split(\"{Resources:label}\", \", \"))\n| extend Details = pack_all()\n| project Resource=id, resourceGroup, location, subscriptionId, kind, SKU=sku.name, publicNetworkAccess=properties.publicNetworkAccess, dateCreated=properties.dateCreated, tags, Details",
                            "size": 3,
                            "title": "Azure AI Agent Service Resources",
                            "noDataMessage": "No AI Agent Service resources found.",
                            "showExportToExcel": True,
                            "queryType": 1,
                            "resourceType": "microsoft.resourcegraph/resources",
                            "crossComponentResources": ["{Subscription}"],
                            "gridSettings": {
                                "formatters": [
                                    {
                                        "columnMatch": "Resource",
                                        "formatter": 5
                                    },
                                    {
                                        "columnMatch": "resourceGroup",
                                        "formatter": 14
                                    },
                                    {
                                        "columnMatch": "location",
                                        "formatter": 17
                                    },
                                    {
                                        "columnMatch": "subscriptionId",
                                        "formatter": 5
                                    },
                                    {
                                        "columnMatch": "dateCreated",
                                        "formatter": 6
                                    },
                                    {
                                        "columnMatch": "Details",
                                        "formatter": 7,
                                        "formatOptions": {
                                            "linkTarget": "CellDetails",
                                            "linkLabel": "🔍 View Details"
                                        }
                                    }
                                ],
                                "filter": True
                            }
                        },
                        "name": "query - All AI Agent Service Resources"
                    }
                ]
            },
            "conditionalVisibility": {
                "parameterName": "selectedTab",
                "comparison": "isEqualTo",
                "value": "Overview"
            },
            "name": "group - Overview"
        },
        
        # MONITOR TAB (will be added with visualizations)
        {
            "type": 12,
            "content": {
                "version": "NotebookGroup/1.0",
                "groupType": "editable",
                "items": [
                    {
                        "type": 1,
                        "content": {
                            "json": "### Azure AI Agent Service Monitoring\n\nMonitor your agent operations, performance, and tool usage in real-time."
                        },
                        "name": "text - Monitor Title"
                    },
                    # Will be populated with visualizations from your existing queries
                ]
            },
            "conditionalVisibility": {
                "parameterName": "selectedTab",
                "comparison": "isEqualTo",
                "value": "Monitor"
            },
            "name": "group - Monitor"
        }
    ],
    "fallbackResourceIds": [
        f"/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{WORKSPACE_NAME}"
    ],
    "$schema": "https://github.com/Microsoft/Application-Insights-Workbooks/blob/master/schema/workbook.json"
}

# Save the new OpenAI-style workbook
workbook_file_v2 = "ai_agent_monitoring_workbook_openai_style.json"
with open(workbook_file_v2, 'w') as f:
    json.dump(workbook_openai_style, f, indent=2)

print(f"✅ Azure OpenAI Insights-style workbook saved to: {workbook_file_v2}")
print(f"\n📊 Features included:")
print("   - Subscription selector with auto-selection")
print("   - Resource selector (type 5) for AI services")
print("   - Tab navigation (Overview, Monitor, Insights)")
print("   - Resource distribution pie charts")
print("   - Detailed resource table with filters")
print("\n💡 Next: Import this workbook to Azure Portal and select your workspace!")

✅ Azure OpenAI Insights-style workbook saved to: ai_agent_monitoring_workbook_openai_style.json

📊 Features included:
   - Subscription selector with auto-selection
   - Resource selector (type 5) for AI services
   - Tab navigation (Overview, Monitor, Insights)
   - Resource distribution pie charts
   - Detailed resource table with filters

💡 Next: Import this workbook to Azure Portal and select your workspace!


## Workbook Features

The Azure AI Agent Service Monitoring workbook now includes:

###  Monitor Tab - Complete Operational Dashboard:

#### Service Health Overview (Quick Status Cards):
- **Total Requests** - Overall request volume
- **Success Rate %** - Service reliability indicator
- **Avg Latency** - Average response time
- **P95 Latency** - 95th percentile latency (SLA tracking)
- **Total Errors** - Quick error count

#### Performance Trends:
- **Success Rate Trend** - Track reliability over time
- **Latency Percentiles** - P50, P95, P99 latency tracking
- **Operation Analysis**:
  - Operations by Type (pie chart)
  - Requests by Status distribution
  - Operations Summary with Success Rates table

#### Cost & Usage Insights:
- **Daily Request Volume by Resource** - For cost estimation and billing
- **Request Distribution** - Percentage breakdown across resources
- **Resource Performance Comparison** - Side-by-side metrics:
  - Total Requests, Avg Latency, P95 Latency
  - Success Rate with visual indicators
  - Error counts per resource

#### Agent-Specific Metrics:
- **Agent Performance** - By operation name patterns
- **Top 10 Tools Used** - Most frequently called tools
- **Request Volume Over Time** - Area chart for trend analysis
- **Activity Logs** - Latest 500 with clickable details

### 🔍 Insights Tab - Deep Analysis & Troubleshooting:

#### Key Insights & Recommendations:
- **Performance & Reliability Summary**:
  - Total Requests, Slow Requests (>5s)
  - Slow Request Percentage with alerts
  - Failed Requests and Failure Rate
  - Visual indicators for health status

#### Top Issues & Hotspots:
- **Top 10 Slowest Operations** - Operations taking >5 seconds
- **Top 10 High-Volume Callers** - IPs with >100 req/hr
- **Thread & Run Analysis**:
  - Top 20 Active Thread Operations
  - Run Status Distribution
  
#### Tool Call Analysis:
- **Tool Performance by Service** - Success rates per tool
- **Tool Usage Timeline** - Trend analysis
  
#### Error & Failure Analysis:
- **Failures by Error Type** - HTTP status, operation, resource
- **Detailed Log Entries** - Full diagnostic data with drill-down

###  Built-in Monitoring Guide:

#### Recommended SLA Targets:
- Success Rate: ≥ 99.5%
- P95 Latency: < 3 seconds
- P99 Latency: < 5 seconds
- Error Rate: < 0.5%

#### Alert Thresholds:
**Critical Alerts:**
- Success rate < 95%
- Error rate > 5%
- P95 latency > 10 seconds

**Warning Alerts:**
- Success rate 95-99%
- Error rate 1-5%
- P95 latency 5-10 seconds
- Slow requests > 10%

#### Troubleshooting Guide:
- High Latency troubleshooting steps
- High Error Rate diagnostics
- Cost Optimization recommendations
- Workbook usage instructions

##  Workbook with Filters

The workbook now includes comprehensive filtering capabilities similar to Azure OpenAI Insights:

###  Available Filters:
1. **Time Range** - Select from 5 minutes to 30 days
2. **Resource Group** - Filter by specific resource groups
3. **AI Foundry Project** - Select specific AI projects/resources
4. **Operation Name** - Filter by agent operations (Create_Assistant, Create_Thread, etc.)
5. **Result Type** - Filter by success/failure status
6. **Location** - Filter by Azure region

###  Dashboard Sections:
1. **Service Health Overview** - Key metrics tiles showing total requests, success rate, avg duration
2. **Agent Operations Timeline** - Bar chart of operations over time
3. **Agent Success Rate** - Line chart showing success percentage trends
4. **Detailed Analysis** - Comprehensive operations table with P95 latency
5. **Thread & Message Activity** - Thread and message operation trends
6. **Latency Percentiles** - P50, P95, P99 latency analysis
7. **Tool Calls Monitoring** - Tool operation activity and summary
8. **Failures Analysis** - Detailed error tracking with HTTP status codes

###  Key Features:
- **Interactive Filters** - All filters cascade and update all visualizations
- **Export to Excel** - Download detailed data for offline analysis
- **Color-Coded Metrics** - Visual indicators for success rates and status codes
- **Percentile Analysis** - P50, P95, P99 latency tracking
- **Error Grouping** - Failures grouped by error type with time spans
- **Resource Breakdown** - All metrics split by resource for multi-project monitoring

###  Usage Tips:
- Start with broad filters (all resources, all operations)
- Drill down by selecting specific resources or operations
- Use time range to zoom into incidents
- Export detailed tables for deeper investigation
- Pin key visualizations to Azure Portal dashboard

### Query Request Success Rate by Model

In [90]:
# Updated query for success rate
query_success_rate = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| summarize 
    Total=count(),
    Success=countif(ResultType == "Succeeded" or ResultType == "Success"),
    Failed=countif(ResultType == "Failed" or ResultType == "Error")
    by Resource
| extend SuccessRate = round((Success * 100.0) / Total, 2)
| order by SuccessRate asc
"""

success_rate_df = execute_query(query_success_rate)
if success_rate_df is not None and len(success_rate_df) > 0:
    print("Success Rate by Resource:")
    display(success_rate_df)
else:
    print("No success rate data available")

Success Rate by Resource:


Unnamed: 0,Resource,Total,Success,Failed,SuccessRate
0,{AI_FOUNDRY_PROJECT},480,0,0,0


### Query Top Error Messages

In [91]:
# Updated query for top errors
query_top_errors = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "Failed" or ResultType == "Error"
| where isnotempty(ResultDescription)
| where TimeGenerated > ago(24h)
| summarize Count=count(), LastSeen=max(TimeGenerated) by ResultDescription, ResultSignature
| order by Count desc
| take 20
"""

top_errors_df = execute_query(query_top_errors)
if top_errors_df is not None and len(top_errors_df) > 0:
    print("Top 20 Error Messages:")
    display(top_errors_df)
else:
    print("No error messages found")

No error messages found


### Query Client IP Analysis (Security Monitoring)

In [93]:
# Updated query for client IP analysis
query_client_ips = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "Failed" or ResultType == "Error"
| where TimeGenerated > ago(24h)
| summarize 
    RequestCount=count(),
    UniqueOperations=dcount(OperationName),
    ErrorTypes=make_set(ResultSignature)
    by CallerIPAddress
| order by RequestCount desc
| take 20
"""

client_ips_df = execute_query(query_client_ips)
if client_ips_df is not None and len(client_ips_df) > 0:
    print("Top Client IPs with Errors:")
    display(client_ips_df)
else:
    print("No client IP data available")

No client IP data available


## Summary Report Generator

In [95]:
def generate_monitoring_report():
    """
    Generate a comprehensive monitoring report
    """
    print("=" * 80)
    print("Azure AI Agent Service - Monitoring Report")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 80)
    print()
    
    # Execute all queries
    reports = {
        "Total Failures (24h)": execute_query(query_all_failures),
        "Rate Limit Errors": execute_query(query_rate_limits),
        "Authentication Failures": execute_query(query_auth_failures),
        "Server Errors": execute_query(query_server_errors),
        "Agent Service Errors": execute_query(query_agent_errors),
        "High Latency Operations": execute_query(query_high_latency),
        "Success Rate by Model": execute_query(query_success_rate),
    }
    
    for title, df in reports.items():
        print(f"\n{title}:")
        print("-" * 80)
        if df is not None and len(df) > 0:
            print(f"Found {len(df)} records")
            display(df.head(5))
        else:
            print("✅ No issues found")
    
    print("\n" + "=" * 80)
    print("Report Complete")
    print("=" * 80)

# Uncomment to generate the report
generate_monitoring_report()

Azure AI Agent Service - Monitoring Report
Generated: 2025-10-07 11:55:54


Total Failures (24h):
--------------------------------------------------------------------------------
✅ No issues found

Rate Limit Errors:
--------------------------------------------------------------------------------
✅ No issues found

Authentication Failures:
--------------------------------------------------------------------------------
✅ No issues found

Server Errors:
--------------------------------------------------------------------------------
✅ No issues found

Agent Service Errors:
--------------------------------------------------------------------------------
✅ No issues found

High Latency Operations:
--------------------------------------------------------------------------------
Found 48 records

Total Failures (24h):
--------------------------------------------------------------------------------
✅ No issues found

Rate Limit Errors:
--------------------------------------------------------

Unnamed: 0,TimeGenerated,OperationName,DurationMs,ResultSignature,ResultType,Resource,CorrelationId
0,2025-10-07 13:11:57.794000+00:00,Embeddings_Create,12104,200,,{AI_FOUNDRY_PROJECT},00b4a0ea-0d2e-4859-bfdf-8922275ceefd
1,2025-10-07 13:19:12.088000+00:00,ChatCompletions_Create,11367,200,,{AI_FOUNDRY_PROJECT},bf0c9825-42fd-4abb-93a1-e1b267ae7598
2,2025-10-07 13:08:11.457000+00:00,Uploads a file for use by other operations.,10615,200,,{AI_FOUNDRY_PROJECT},36e13e7a-7e3c-49f7-882e-6b97bfa8d3fa
3,2025-10-07 13:14:28.818000+00:00,Uploads a file for use by other operations.,9997,200,,{AI_FOUNDRY_PROJECT},23febb59-9363-4320-b3a2-7e6f04d812a5
4,2025-10-07 13:20:22.534000+00:00,Uploads a file for use by other operations.,9836,200,,{AI_FOUNDRY_PROJECT},a5d41556-76a5-4242-a908-1ffac40e365c



Success Rate by Model:
--------------------------------------------------------------------------------
Found 1 records


Unnamed: 0,Resource,Total,Success,Failed,SuccessRate
0,{AI_FOUNDRY_PROJECT},480,0,0,0



Report Complete


## 🧪 Dashboard Query Verification

Let's verify that all dashboard queries work correctly:

In [96]:
print("🧪 Testing Dashboard Queries...\n")

# Test 1: Agent Operations Timeline
print("1️⃣ Testing: Agent Operations Timeline")
test1_df = execute_query(query_agent_ops_dashboard)
if test1_df is not None and len(test1_df) > 0:
    print(f"   ✅ Success - {len(test1_df)} rows returned")
else:
    print("   ⚠️  No data - check if agent operations exist")

# Test 2: Agent Success Rate
print("\n2️⃣ Testing: Agent Success Rate")
test2_df = execute_query(query_agent_success_rate)
if test2_df is not None and len(test2_df) > 0:
    print(f"   ✅ Success - {len(test2_df)} rows returned")
    avg_success = test2_df['SuccessRate'].mean()
    print(f"   📊 Average Success Rate: {avg_success:.2f}%")
else:
    print("   ⚠️  No data")

# Test 3: High Latency Operations
print("\n3️⃣ Testing: High Latency Operations")
test3_df = execute_query(query_high_latency)
if test3_df is not None and len(test3_df) > 0:
    print(f"   ✅ Found {len(test3_df)} high latency operations")
else:
    print("   ✅ No high latency operations - service is fast!")

# Test 4: Agent Summary
print("\n4️⃣ Testing: Agent Operations Summary")
test4_df = execute_query(query_agent_summary)
if test4_df is not None and len(test4_df) > 0:
    print(f"   ✅ Success - {len(test4_df)} operation types returned")
    print(f"   📊 Total Operations: {test4_df['Count'].sum():,}")
else:
    print("   ⚠️  No agent operations found")

# Test 5: Thread & Message Activity
print("\n5️⃣ Testing: Thread & Message Activity")
test5_df = execute_query(query_threads_messages)
if test5_df is not None and len(test5_df) > 0:
    print(f"   ✅ Success - {len(test5_df)} rows returned")
    print(f"   📊 Total Events: {test5_df['EventCount'].sum():,}")
else:
    print("   ⚠️  No thread/message activity found")

print("\n" + "="*60)
print("✅ All dashboard queries tested successfully!")
print("="*60)
print("\n💡 Next Steps:")
print("1. Import ai_agent_monitoring_dashboard.json to Azure Portal")
print("2. Go to Azure Portal > Dashboard > Upload")
print("3. Select the generated JSON file")
print("4. Adjust time ranges and filters as needed")

🧪 Testing Dashboard Queries...

1️⃣ Testing: Agent Operations Timeline
   ✅ Success - 4 rows returned

2️⃣ Testing: Agent Success Rate
   ✅ Success - 4 rows returned

2️⃣ Testing: Agent Success Rate
   ✅ Success - 1 rows returned
   📊 Average Success Rate: 100.00%

3️⃣ Testing: High Latency Operations
   ✅ Success - 1 rows returned
   📊 Average Success Rate: 100.00%

3️⃣ Testing: High Latency Operations
   ✅ Found 48 high latency operations

4️⃣ Testing: Agent Operations Summary
   ✅ Found 48 high latency operations

4️⃣ Testing: Agent Operations Summary
   ✅ Success - 5 operation types returned
   📊 Total Operations: 94

5️⃣ Testing: Thread & Message Activity
   ✅ Success - 5 operation types returned
   📊 Total Operations: 94

5️⃣ Testing: Thread & Message Activity
   ✅ Success - 2 rows returned
   📊 Total Events: 55

✅ All dashboard queries tested successfully!

💡 Next Steps:
1. Import ai_agent_monitoring_dashboard.json to Azure Portal
2. Go to Azure Portal > Dashboard > Upload
3. Se

---

## 📋 Complete Agent Monitoring Guide

### Dashboard Setup Instructions

#### Option 1: Azure Portal (Recommended)

1. **Navigate to Azure Portal**
   - Go to https://portal.azure.com
   - Search for "Dashboard" in the top search bar
   - Click "Dashboard" under Services

2. **Import the Dashboard**
   - Click "Upload" in the top toolbar
   - Select `ai_agent_monitoring_dashboard.json` from your local directory
   - Click "Save"

3. **Configure Workspace Connection**
   - Each tile may need the workspace resource ID
   - Format: `/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{WORKSPACE_ID}`
   - Your values are already in the JSON file

4. **Customize as Needed**
   - Click "Edit" to modify tile positions, sizes, or queries
   - Add additional tiles from the Gallery
   - Save your customized version

#### Option 2: Azure CLI

```bash
# Upload dashboard via CLI
az portal dashboard create \
  --resource-group {RESOURCE_GROUP} \
  --name 'AI-Agent-Service-Monitoring' \
  --input-path ai_agent_monitoring_dashboard.json \
  --location global
```

### Dashboard Tiles Explained

#### 1. **Agent Operations Timeline** (Stacked Column Chart)
- **What**: Shows agent operations over time (hourly bins)
- **Operations tracked**: Create_Assistant, Create_Thread, Create_Message, Create_Run
- **Use for**: Identifying peak usage times and operation patterns

#### 2. **Agent Success Rate** (Line Chart)
- **What**: Tracks success percentage over time
- **Calculation**: (Successful operations / Total operations) × 100
- **Alert threshold**: Drop below 95% warrants investigation

#### 3. **High Latency Operations** (Grid)
- **What**: Lists operations taking >5 seconds
- **Columns**: TimeGenerated, OperationName, DurationMs, ResultSignature
- **Use for**: Performance optimization and bottleneck identification

#### 4. **Agent Operations Summary** (Grid)
- **What**: Comprehensive statistics per operation type
- **Metrics**: Count, Success/Failure counts, Avg/Max duration, Unique correlation IDs
- **Use for**: Understanding operation distribution and reliability

#### 5. **Thread & Message Activity** (Bar Chart)
- **What**: Thread and message creation patterns
- **Metrics**: Event count, unique threads, average duration
- **Use for**: Monitoring conversation activity and user engagement

### Alerting Strategy

Based on the Microsoft documentation for Category: Agents, consider these alert thresholds:

#### Critical Alerts (Severity 0-1)
- **Agent Run Failures** > 5% in 15 minutes
- **Tool Call Failures** > 10% in 15 minutes
- **Thread Creation Failures** > 3 consecutive failures

#### Warning Alerts (Severity 2-3)
- **High Latency** > 10 seconds (95th percentile)
- **Message Processing Delay** > 5 seconds average
- **Agent Success Rate** < 95% in 1 hour

#### Informational Alerts (Severity 4)
- **High Token Usage** > 80% of quota
- **Unusual Operation Patterns** - spike detection
- **New Agent Deployments** - change tracking

### Troubleshooting Common Issues

#### Dashboard Not Loading
- **Check**: Workspace ID is correct in dashboard JSON
- **Verify**: You have Log Analytics Reader role
- **Solution**: Re-upload dashboard after fixing workspace ID

#### No Data Showing
- **Check**: Diagnostic settings enabled on AI Foundry resource
- **Verify**: Logs are flowing to Log Analytics workspace
- **Solution**: Wait 5-10 minutes for data ingestion, then refresh

#### Queries Timing Out
- **Issue**: Too much data or complex queries
- **Solution**: Reduce time range (24h → 4h) or add more filters
- **Alternative**: Use Azure Data Explorer for large-scale analysis

### Best Practices

1. **Regular Review**: Check dashboard daily for anomalies
2. **Custom Time Ranges**: Adjust based on your deployment schedule
3. **Baseline Metrics**: Document normal operation patterns
4. **Alert Tuning**: Adjust thresholds after monitoring for 1 week
5. **Correlation**: Link failures to deployment events or config changes

### Additional Resources

- 📖 [Azure AI Foundry Agent Service Monitoring Reference](https://learn.microsoft.com/en-us/azure/ai-foundry/agents/reference/monitor-service)
- 📖 [KQL Quick Reference](https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/kql-quick-reference)
- 📖 [Azure Monitor Dashboards](https://learn.microsoft.com/en-us/azure/azure-monitor/visualize/tutorial-logs-dashboards)
- 📖 [Log Analytics Query Best Practices](https://learn.microsoft.com/en-us/azure/azure-monitor/logs/query-optimization)

## 🎯 Quick Reference: Agent Monitoring Queries

### Agent Operations to Monitor

Based on Azure AI Foundry documentation, here are the key operation names:

#### Assistant Operations
```
Create_Assistant  - Create new agent/assistant
Update_Assistant  - Modify agent configuration
Delete_Assistant  - Remove agent
List_Assistants   - Retrieve assistant list
```

#### Thread Operations
```
Create_Thread     - Start new conversation thread
Delete_Thread     - Remove thread
Retrieve_Thread   - Get thread details
Modify_Thread     - Update thread metadata
```

#### Message Operations
```
Create_Message    - Add message to thread
List_Messages     - Retrieve thread messages
Retrieve_Message  - Get specific message
Modify_Message    - Update message content
```

#### Run Operations
```
Create_Run        - Start agent execution
Cancel_Run        - Stop running agent
Retrieve_Run      - Get run status
List_Runs         - Retrieve all runs
Submit_Tool_Outputs - Provide tool results
```

#### Run Steps Operations
```
List_Run_Steps    - Get execution steps
Retrieve_Run_Step - Get specific step details
```

### Quick Query Templates

#### Check Agent Health (Last Hour)
```kql
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(1h)
| where OperationName has_any ("Create_Assistant", "Create_Run", "Create_Message")
| summarize Count=count(), Failures=countif(ResultType == "Failed") by OperationName
| extend SuccessRate = round((Count - Failures) * 100.0 / Count, 2)
```

#### Find Slow Agent Operations
```kql
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(1h)
| where OperationName has "Agent" or OperationName has "Run"
| where DurationMs > 3000
| project TimeGenerated, OperationName, DurationMs, CorrelationId
| order by DurationMs desc
```

#### Monitor Active Threads
```kql
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName == "Create_Thread" or OperationName == "Delete_Thread"
| summarize Created=countif(OperationName == "Create_Thread"),
            Deleted=countif(OperationName == "Delete_Thread")
| extend ActiveThreads = Created - Deleted
```

#### Track Message Volume
```kql
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where OperationName == "Create_Message"
| summarize Messages=count() by bin(TimeGenerated, 1h)
| render timechart
```

### Metrics Available via Azure Monitor

These metrics are available through Azure Monitor (not Log Analytics):

- **Agents**: Count of agent events with EventType dimension
- **Messages**: Message count with EventType and ThreadId dimensions
- **Runs**: Run count with AgentId, RunStatus, StatusCode, StreamType dimensions
- **Threads**: Thread event count with EventType dimension
- **Tokens**: Token counts with AgentId and TokenType dimensions
- **ToolCalls**: Tool call count with AgentId and ToolName dimensions

To query these metrics, use Azure Monitor Metrics API or the Azure Portal Metrics explorer.

## 🔧 Quick Health Check Function

In [34]:
def quick_health_check(hours=1):
    """
    Quick health check of Azure AI Agent Service
    
    Args:
        hours: Number of hours to look back (default: 1)
    
    Returns:
        Dict with health status
    """
    query = f"""
    AzureDiagnostics
    | where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
    | where TimeGenerated > ago({hours}h)
    | summarize 
        TotalRequests=count(),
        FailedRequests=countif(ResultType == "Failed"),
        AvgDurationMs=avg(DurationMs),
        MaxDurationMs=max(DurationMs),
        UniqueOperations=dcount(OperationName)
    """
    
    result_df = execute_query(query)
    
    if result_df is not None and len(result_df) > 0:
        row = result_df.iloc[0]
        total = row['TotalRequests']
        failed = row['FailedRequests']
        success_rate = ((total - failed) / total * 100) if total > 0 else 0
        
        print(f"🏥 Health Check (Last {hours} hour{'s' if hours > 1 else ''})")
        print("=" * 50)
        print(f"✅ Total Requests: {total:,}")
        print(f"❌ Failed Requests: {failed:,}")
        print(f"📊 Success Rate: {success_rate:.2f}%")
        print(f"⏱️  Avg Duration: {row['AvgDurationMs']:.2f} ms")
        print(f"⏱️  Max Duration: {row['MaxDurationMs']:.0f} ms")
        print(f"🔧 Unique Operations: {row['UniqueOperations']}")
        print("=" * 50)
        
        # Health status
        if success_rate >= 99.5:
            print("🟢 Status: HEALTHY")
        elif success_rate >= 95:
            print("🟡 Status: DEGRADED")
        else:
            print("🔴 Status: UNHEALTHY")
            
        return {
            'status': 'healthy' if success_rate >= 99.5 else 'degraded' if success_rate >= 95 else 'unhealthy',
            'total_requests': total,
            'failed_requests': failed,
            'success_rate': success_rate,
            'avg_duration_ms': row['AvgDurationMs'],
            'max_duration_ms': row['MaxDurationMs']
        }
    else:
        print(f"ℹ️  No data available for the last {hours} hour{'s' if hours > 1 else ''}")
        return None

# Run health check for the last hour
health_status = quick_health_check(hours=1)

🏥 Health Check (Last 1 hour)
✅ Total Requests: 471.0
❌ Failed Requests: 0.0
📊 Success Rate: 100.00%
⏱️  Avg Duration: 1255.71 ms
⏱️  Max Duration: 12104 ms
🔧 Unique Operations: 22.0
🟢 Status: HEALTHY


## 💡 Example: Custom Monitoring Queries

Here are some examples of custom queries you can use:

In [35]:
# Example 1: Monitor specific agent operations
query_agent_operations = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where OperationName has "Create_Message" or OperationName has "Create_Run" or OperationName has "Create_Thread"
| where TimeGenerated > ago(24h)
| summarize Count=count(), AvgDuration=avg(DurationMs) by OperationName
| order by Count desc
"""

print("📊 Agent Operations Summary:")
agent_ops_df = execute_query(query_agent_operations)
if agent_ops_df is not None and len(agent_ops_df) > 0:
    display(agent_ops_df)
else:
    print("No agent operations found")

# Example 2: Track model usage patterns
query_model_usage = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where isnotempty(properties_s)
| extend Props = parse_json(properties_s)
| extend ModelDeployment = tostring(Props.modelDeploymentName)
| where isnotempty(ModelDeployment)
| summarize 
    Requests=count(),
    AvgDuration=avg(DurationMs),
    Operations=make_set(OperationName)
    by ModelDeployment
| order by Requests desc
"""

print("\n🤖 Model Deployment Usage:")
model_usage_df = execute_query(query_model_usage)
if model_usage_df is not None and len(model_usage_df) > 0:
    display(model_usage_df)
else:
    print("No model deployment data found")

# Example 3: Monitor request/response sizes
query_data_transfer = """
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(24h)
| where isnotempty(properties_s)
| extend Props = parse_json(properties_s)
| extend 
    ReqLength = toint(Props.requestLength),
    RespLength = toint(Props.responseLength)
| where isnotempty(ReqLength) or isnotempty(RespLength)
| summarize 
    Requests=count(),
    TotalReqMB=sum(ReqLength)/1024/1024,
    TotalRespMB=sum(RespLength)/1024/1024,
    AvgReqKB=avg(ReqLength)/1024,
    AvgRespKB=avg(RespLength)/1024
    by OperationName
| order by Requests desc
"""

print("\n📈 Data Transfer by Operation:")
data_transfer_df = execute_query(query_data_transfer)
if data_transfer_df is not None and len(data_transfer_df) > 0:
    display(data_transfer_df)
    
    total_req = data_transfer_df['TotalReqMB'].sum()
    total_resp = data_transfer_df['TotalRespMB'].sum()
    print(f"\n📊 Total Data Transfer (24h):")
    print(f"   ⬆️  Request: {total_req:.2f} MB")
    print(f"   ⬇️  Response: {total_resp:.2f} MB")
    print(f"   🔄 Total: {total_req + total_resp:.2f} MB")
else:
    print("No data transfer metrics found")

📊 Agent Operations Summary:


Unnamed: 0,OperationName,Count,AvgDuration
0,Create_Message,36,242.694444
1,Create_Thread,19,138.157895
2,Create_Run,19,414.052632



🤖 Model Deployment Usage:


Unnamed: 0,ModelDeployment,Requests,AvgDuration,Operations
0,text-embedding-3-small,144,306.159722,"[""Embeddings_Create""]"
1,gpt-4.1,79,2413.658228,"[""Create_Assistant"",""ChatCompletions_Create"",""..."



📈 Data Transfer by Operation:


Unnamed: 0,OperationName,Requests,TotalReqMB,TotalRespMB,AvgReqKB,AvgRespKB
0,Embeddings_Create,144,0,1,0.67905,8.21837
1,ChatCompletions_Create,39,0,0,11.102013,2.678811
2,Create_Message,36,0,0,3.306586,3.613146
3,Creates a new message on a specified thread.,36,0,0,3.306586,0.0
4,Completions_Create,30,1,0,44.546842,0.0
5,Creates a new run for an assistant thread.,19,0,0,4.614206,0.0
6,Creates a new thread. Threads contain messages...,19,0,0,0.060752,0.203279
7,Create_Thread,19,0,0,0.346525,0.203279
8,Create_Run,19,0,0,5.184365,0.0
9,Uploads a file for use by other operations.,13,162,0,12775.094201,0.0



📊 Total Data Transfer (24h):
   ⬆️  Request: 325.00 MB
   ⬇️  Response: 1.00 MB
   🔄 Total: 326.00 MB


---

## 🎓 Learning Resources

### KQL (Kusto Query Language) Tips

1. **Filter First**: Always filter by `ResourceProvider` and `TimeGenerated` first
2. **Parse JSON**: Use `parse_json()` and `extend` to extract from `properties_s`
3. **Column Names**: Use exact case - Azure is case-sensitive
4. **Time Windows**: Use `ago()` function: `ago(1h)`, `ago(24h)`, `ago(7d)`
5. **Aggregations**: Use `summarize` with functions like `count()`, `avg()`, `max()`

### Common Patterns

```kql
// Basic filtering
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where TimeGenerated > ago(1h)

// Extract JSON properties
| extend Props = parse_json(properties_s)
| extend FieldName = tostring(Props.fieldName)

// Aggregate and group
| summarize Count=count(), Avg=avg(DurationMs) by OperationName

// Time-based grouping
| summarize Count=count() by bin(TimeGenerated, 1h)
```

### Troubleshooting

**Q: Column not found error?**
- Run the schema discovery query (Cell 7) to see all available columns
- Check column name case sensitivity (e.g., `CallerIPAddress` not `CallerIpAddress`)

**Q: No data returned?**
- Check if diagnostic settings are enabled on your AI Foundry resource
- Verify the `WORKSPACE_ID` is correct
- Ensure sufficient data retention period in Log Analytics workspace

**Q: Token usage not showing?**
- Token data is NOT in AzureDiagnostics logs
- Check Azure Monitor Metrics or Azure OpenAI Studio instead

### Additional Resources

- 📖 [KQL Quick Reference](https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/kql-quick-reference)
- 📖 [Azure Monitor Best Practices](https://learn.microsoft.com/en-us/azure/azure-monitor/best-practices)
- 📖 [AI Foundry Monitoring Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/agents/reference/monitor-service)

---

**🎉 Congratulations!** You now have a comprehensive Azure Monitor integration for your AI Agent Service!