# Lab 6: Observability & Monitoring

Welcome to Lab 6! In this lab, we'll add comprehensive observability to our production SAP agent. We'll implement monitoring, logging, metrics, and alerting to ensure our agent runs smoothly in production.

## 🎯 Learning Objectives

By the end of this lab, you will:
- Understand AgentCore Observability concepts
- Set up CloudWatch monitoring and logging
- Create custom metrics and dashboards
- Configure alerting and notifications
- Implement performance tracking
- Learn about troubleshooting and debugging

## ⏱️ Estimated Time: 20 minutes

## Architecture for Lab 6

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Interfaces    │───▶│ AgentCore       │───▶│ SAP Agent       │
│                 │    │ Runtime         │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐
                       │ Observability   │
                       │ • CloudWatch    │
                       │ • X-Ray         │
                       │ • Dashboards    │
                       │ • Alerts        │
                       └─────────────────┘
```

In [None]:
# Import required libraries
import sys
import os
import json
import boto3
from typing import Dict, Any, List, Optional
from datetime import datetime, timedelta
import random

from utils import (
    print_header, print_success, print_error, print_info, print_warning,
    check_aws_credentials, workshop_progress
)

# Display lab header
print_header("Lab 6: Observability & Monitoring")
print_info("Setting up comprehensive monitoring for the production SAP agent")

## Step 1: CloudWatch Metrics Dashboard

Let's create a comprehensive CloudWatch dashboard for monitoring our agent.

In [None]:
# Create CloudWatch dashboard configuration
print_header("Creating CloudWatch Metrics Dashboard", level=2)

# Generate sample metrics data
def generate_sample_metrics():
    """Generate sample metrics for demonstration."""
    now = datetime.now()
    metrics = []
    
    # Generate hourly data for the last 24 hours
    for i in range(24):
        timestamp = now - timedelta(hours=i)
        
        metrics.append({
            "timestamp": timestamp.isoformat(),
            "requests_per_hour": random.randint(100, 500),
            "avg_response_time_ms": random.randint(800, 2000),
            "success_rate_percent": round(random.uniform(98.0, 99.9), 2),
            "error_count": random.randint(0, 5),
            "active_sessions": random.randint(20, 100),
            "sap_api_calls": random.randint(50, 200),
            "email_notifications": random.randint(5, 25),
            "knowledge_base_queries": random.randint(10, 40)
        })
    
    return metrics

# Generate sample data
sample_metrics = generate_sample_metrics()

print_success("Sample metrics data generated for dashboard")
print_info(f"Generated {len(sample_metrics)} data points for the last 24 hours")

# Display current metrics summary
latest_metrics = sample_metrics[0]
print("\n📊 **Current Metrics (Last Hour):**")
print(f"   - Requests: {latest_metrics['requests_per_hour']}/hour")
print(f"   - Avg Response Time: {latest_metrics['avg_response_time_ms']}ms")
print(f"   - Success Rate: {latest_metrics['success_rate_percent']}%")
print(f"   - Active Sessions: {latest_metrics['active_sessions']}")
print(f"   - SAP API Calls: {latest_metrics['sap_api_calls']}/hour")
print(f"   - Email Notifications: {latest_metrics['email_notifications']}/hour")

## Step 2: Performance Analytics

Let's analyze the performance trends and identify optimization opportunities.

In [None]:
# Analyze performance trends
print_header("Performance Analytics", level=2)

# Calculate performance statistics
total_requests = sum(m['requests_per_hour'] for m in sample_metrics)
avg_response_time = sum(m['avg_response_time_ms'] for m in sample_metrics) / len(sample_metrics)
avg_success_rate = sum(m['success_rate_percent'] for m in sample_metrics) / len(sample_metrics)
total_errors = sum(m['error_count'] for m in sample_metrics)
peak_sessions = max(m['active_sessions'] for m in sample_metrics)
total_sap_calls = sum(m['sap_api_calls'] for m in sample_metrics)

print_info("📈 **24-Hour Performance Summary:**")
print(f"\n🚀 **Volume Metrics:**")
print(f"   - Total Requests: {total_requests:,}")
print(f"   - Peak Concurrent Sessions: {peak_sessions}")
print(f"   - Total SAP API Calls: {total_sap_calls:,}")
print(f"   - Average Requests/Hour: {total_requests//24:,}")

print(f"\n⚡ **Performance Metrics:**")
print(f"   - Average Response Time: {avg_response_time:.0f}ms")
print(f"   - Success Rate: {avg_success_rate:.2f}%")
print(f"   - Total Errors: {total_errors}")
print(f"   - Error Rate: {(total_errors/total_requests)*100:.3f}%")

# Performance assessment
print(f"\n🎯 **Performance Assessment:**")
if avg_response_time < 1000:
    print("   ✅ Response time: Excellent (< 1s)")
elif avg_response_time < 2000:
    print("   ⚠️  Response time: Good (1-2s)")
else:
    print("   ❌ Response time: Needs improvement (> 2s)")

if avg_success_rate > 99.5:
    print("   ✅ Success rate: Excellent (> 99.5%)")
elif avg_success_rate > 99.0:
    print("   ⚠️  Success rate: Good (> 99%)")
else:
    print("   ❌ Success rate: Needs improvement (< 99%)")

if (total_errors/total_requests)*100 < 0.1:
    print("   ✅ Error rate: Excellent (< 0.1%)")
elif (total_errors/total_requests)*100 < 0.5:
    print("   ⚠️  Error rate: Acceptable (< 0.5%)")
else:
    print("   ❌ Error rate: Needs attention (> 0.5%)")

print_success("Performance analysis completed")

## Step 3: Error Tracking & Alerting

Let's set up error tracking and alerting mechanisms.

In [None]:
# Error tracking and alerting setup
print_header("Error Tracking & Alerting", level=2)

# Simulate error categories and their frequencies
error_categories = {
    "SAP_CONNECTION_TIMEOUT": random.randint(0, 3),
    "INVALID_ORDER_ID": random.randint(0, 5),
    "EMAIL_DELIVERY_FAILED": random.randint(0, 2),
    "KNOWLEDGE_BASE_UNAVAILABLE": random.randint(0, 1),
    "MEMORY_SERVICE_ERROR": random.randint(0, 2),
    "AUTHENTICATION_FAILED": random.randint(0, 1),
    "RATE_LIMIT_EXCEEDED": random.randint(0, 2)
}

print_info("🚨 **Error Categories (Last 24 Hours):**")
total_categorized_errors = 0
for category, count in error_categories.items():
    if count > 0:
        severity = "🔴 HIGH" if count > 3 else "🟡 MEDIUM" if count > 1 else "🟢 LOW"
        print(f"   - {category}: {count} occurrences ({severity})")
        total_categorized_errors += count

if total_categorized_errors == 0:
    print("   ✅ No errors detected in the last 24 hours")

# Alert configuration
alert_config = {
    "response_time_threshold_ms": 3000,
    "error_rate_threshold_percent": 1.0,
    "success_rate_threshold_percent": 98.0,
    "sap_connection_timeout_threshold": 5,
    "notification_channels": [
        "email: ops-team@company.com",
        "slack: #sap-agent-alerts",
        "sns: arn:aws:sns:us-east-1:123456789012:sap-agent-alerts"
    ]
}

print(f"\n🔔 **Alert Configuration:**")
print(f"   - Response Time Alert: > {alert_config['response_time_threshold_ms']}ms")
print(f"   - Error Rate Alert: > {alert_config['error_rate_threshold_percent']}%")
print(f"   - Success Rate Alert: < {alert_config['success_rate_threshold_percent']}%")
print(f"   - SAP Connection Alert: > {alert_config['sap_connection_timeout_threshold']} timeouts/hour")

print(f"\n📢 **Notification Channels:**")
for channel in alert_config['notification_channels']:
    print(f"   - {channel}")

# Check current status against thresholds
print(f"\n🎯 **Current Alert Status:**")
current_error_rate = (total_errors/total_requests)*100

if avg_response_time > alert_config['response_time_threshold_ms']:
    print(f"   🚨 ALERT: Response time ({avg_response_time:.0f}ms) exceeds threshold")
else:
    print(f"   ✅ Response time within acceptable range")

if current_error_rate > alert_config['error_rate_threshold_percent']:
    print(f"   🚨 ALERT: Error rate ({current_error_rate:.2f}%) exceeds threshold")
else:
    print(f"   ✅ Error rate within acceptable range")

if avg_success_rate < alert_config['success_rate_threshold_percent']:
    print(f"   🚨 ALERT: Success rate ({avg_success_rate:.2f}%) below threshold")
else:
    print(f"   ✅ Success rate within acceptable range")

print_success("Error tracking and alerting configured")

## Step 4: Usage Analytics

Let's analyze usage patterns and user behavior.

In [None]:
# Usage analytics
print_header("Usage Analytics", level=2)

# Simulate usage patterns
usage_patterns = {
    "top_queries": [
        {"query": "Show blocked orders", "count": random.randint(800, 1200), "avg_response_time": random.randint(900, 1100)},
        {"query": "Order details SO*", "count": random.randint(600, 900), "avg_response_time": random.randint(1000, 1300)},
        {"query": "Remove delivery block", "count": random.randint(300, 500), "avg_response_time": random.randint(1200, 1500)},
        {"query": "Send email notification", "count": random.randint(200, 400), "avg_response_time": random.randint(800, 1000)},
        {"query": "Troubleshooting help", "count": random.randint(150, 300), "avg_response_time": random.randint(1500, 2000)}
    ],
    "user_segments": {
        "sales_managers": {"sessions": random.randint(200, 300), "avg_session_duration": "8.5 minutes"},
        "customer_service": {"sessions": random.randint(400, 600), "avg_session_duration": "12.3 minutes"},
        "operations_team": {"sessions": random.randint(150, 250), "avg_session_duration": "15.7 minutes"},
        "system_integrations": {"sessions": random.randint(100, 200), "avg_session_duration": "2.1 minutes"}
    },
    "peak_hours": [
        {"hour": "09:00-10:00", "requests": random.randint(400, 600)},
        {"hour": "10:00-11:00", "requests": random.randint(450, 650)},
        {"hour": "14:00-15:00", "requests": random.randint(380, 580)},
        {"hour": "15:00-16:00", "requests": random.randint(420, 620)}
    ]
}

print_info("📊 **Usage Analytics (Last 24 Hours):**")

print(f"\n🔥 **Top Queries:**")
for i, query in enumerate(usage_patterns['top_queries'], 1):
    print(f"   {i}. {query['query']}")
    print(f"      - Count: {query['count']:,} requests")
    print(f"      - Avg Response: {query['avg_response_time']}ms")

print(f"\n👥 **User Segments:**")
total_user_sessions = 0
for segment, data in usage_patterns['user_segments'].items():
    print(f"   - {segment.replace('_', ' ').title()}:")
    print(f"     • Sessions: {data['sessions']:,}")
    print(f"     • Avg Duration: {data['avg_session_duration']}")
    total_user_sessions += data['sessions']

print(f"\n⏰ **Peak Usage Hours:**")
for peak in usage_patterns['peak_hours']:
    print(f"   - {peak['hour']}: {peak['requests']:,} requests")

# Usage insights
print(f"\n💡 **Usage Insights:**")
print(f"   - Total User Sessions: {total_user_sessions:,}")
print(f"   - Most Popular Feature: Order Status Queries")
print(f"   - Primary Users: Customer Service Teams")
print(f"   - Peak Usage: Morning hours (9-11 AM)")
print(f"   - Average Session: 9.7 minutes")

# Recommendations
print(f"\n🎯 **Optimization Recommendations:**")
print(f"   1. **Cache frequent queries** - 'Show blocked orders' represents 35% of traffic")
print(f"   2. **Optimize troubleshooting responses** - Highest response times (1.5-2s)")
print(f"   3. **Scale during peak hours** - Consider auto-scaling for 9-11 AM")
print(f"   4. **Improve customer service UX** - Primary user group with longest sessions")

print_success("Usage analytics completed")

## Step 5: Observability Dashboard

Let's create a comprehensive observability dashboard configuration.

In [None]:
# Create observability dashboard configuration
print_header("Observability Dashboard Configuration", level=2)

dashboard_config = {
    "dashboard_name": "SAP-Sales-Order-Agent-Production",
    "refresh_interval": "1m",
    "time_range": "24h",
    "widgets": [
        {
            "title": "Request Volume",
            "type": "line_chart",
            "metrics": ["RequestsPerMinute", "ActiveSessions"],
            "position": {"x": 0, "y": 0, "width": 12, "height": 6}
        },
        {
            "title": "Response Times",
            "type": "line_chart",
            "metrics": ["AverageResponseTime", "P95ResponseTime", "P99ResponseTime"],
            "position": {"x": 12, "y": 0, "width": 12, "height": 6}
        },
        {
            "title": "Success Rate",
            "type": "gauge",
            "metrics": ["SuccessRate"],
            "thresholds": {"warning": 99.0, "critical": 98.0},
            "position": {"x": 0, "y": 6, "width": 6, "height": 6}
        },
        {
            "title": "Error Rate",
            "type": "gauge",
            "metrics": ["ErrorRate"],
            "thresholds": {"warning": 0.5, "critical": 1.0},
            "position": {"x": 6, "y": 6, "width": 6, "height": 6}
        },
        {
            "title": "SAP Integration Health",
            "type": "status_grid",
            "metrics": ["SAPConnectionStatus", "SAPResponseTime", "SAPErrorRate"],
            "position": {"x": 12, "y": 6, "width": 12, "height": 6}
        },
        {
            "title": "Top Errors",
            "type": "table",
            "metrics": ["ErrorsByType"],
            "position": {"x": 0, "y": 12, "width": 12, "height": 6}
        },
        {
            "title": "Resource Utilization",
            "type": "line_chart",
            "metrics": ["CPUUtilization", "MemoryUtilization", "NetworkIO"],
            "position": {"x": 12, "y": 12, "width": 12, "height": 6}
        }
    ],
    "alerts": [
        {
            "name": "High Response Time",
            "condition": "AverageResponseTime > 3000ms for 5 minutes",
            "severity": "WARNING",
            "actions": ["email", "slack"]
        },
        {
            "name": "Low Success Rate",
            "condition": "SuccessRate < 98% for 3 minutes",
            "severity": "CRITICAL",
            "actions": ["email", "slack", "pagerduty"]
        },
        {
            "name": "SAP Connection Issues",
            "condition": "SAPConnectionErrors > 5 in 10 minutes",
            "severity": "HIGH",
            "actions": ["email", "slack"]
        }
    ]
}

print_info("🎛️ **Observability Dashboard Configuration:**")
print(f"\n📊 **Dashboard Details:**")
print(f"   - Name: {dashboard_config['dashboard_name']}")
print(f"   - Refresh Interval: {dashboard_config['refresh_interval']}")
print(f"   - Time Range: {dashboard_config['time_range']}")
print(f"   - Total Widgets: {len(dashboard_config['widgets'])}")

print(f"\n📈 **Dashboard Widgets:**")
for widget in dashboard_config['widgets']:
    print(f"   - {widget['title']} ({widget['type']})")

print(f"\n🚨 **Configured Alerts:**")
for alert in dashboard_config['alerts']:
    print(f"   - {alert['name']} ({alert['severity']})")
    print(f"     Condition: {alert['condition']}")
    print(f"     Actions: {', '.join(alert['actions'])}")

# Save dashboard configuration
with open('observability_dashboard.json', 'w') as f:
    json.dump(dashboard_config, f, indent=2)

print_success("Observability dashboard configuration created")
print_info("Configuration saved to: observability_dashboard.json")

## Step 6: Troubleshooting Guide

Let's create a comprehensive troubleshooting guide for operations teams.

In [None]:
# Create troubleshooting guide
print_header("Troubleshooting Guide", level=2)

troubleshooting_guide = {
    "common_issues": [
        {
            "issue": "High Response Times",
            "symptoms": ["Response times > 3 seconds", "User complaints about slowness", "Timeout errors"],
            "causes": ["SAP system overload", "Network latency", "Memory pressure", "Database locks"],
            "solutions": [
                "Check SAP system status and performance",
                "Review network connectivity to SAP",
                "Scale up AgentCore Runtime instances",
                "Check for memory leaks or high CPU usage",
                "Review and optimize database queries"
            ]
        },
        {
            "issue": "SAP Connection Failures",
            "symptoms": ["SAP timeout errors", "Authentication failures", "Connection refused errors"],
            "causes": ["SAP system downtime", "Network issues", "Credential expiration", "Firewall changes"],
            "solutions": [
                "Verify SAP system availability",
                "Check network connectivity and firewall rules",
                "Validate SAP credentials in Secrets Manager",
                "Review AgentCore Gateway configuration",
                "Check SAP user permissions and locks"
            ]
        },
        {
            "issue": "Email Delivery Failures",
            "symptoms": ["Email not received", "SNS delivery failures", "Bounce notifications"],
            "causes": ["Invalid email addresses", "SNS topic issues", "Email service limits", "Spam filters"],
            "solutions": [
                "Validate email address format",
                "Check SNS topic configuration and permissions",
                "Review email service quotas and limits",
                "Check spam filters and email reputation",
                "Verify SES configuration if using SES"
            ]
        },
        {
            "issue": "Memory Service Errors",
            "symptoms": ["Conversation history lost", "Session context errors", "Memory timeouts"],
            "causes": ["Memory service downtime", "Storage limits", "Permission issues", "Network connectivity"],
            "solutions": [
                "Check AgentCore Memory service status",
                "Review memory storage quotas and usage",
                "Validate IAM permissions for memory access",
                "Check network connectivity to memory service",
                "Review memory retention policies"
            ]
        }
    ],
    "diagnostic_commands": [
        {
            "purpose": "Check agent health",
            "command": "curl -X GET https://your-api-endpoint/health",
            "expected_output": '{"status": "healthy", "agent_status": "active"}'
        },
        {
            "purpose": "Test SAP connectivity",
            "command": "agentcore invoke '{\"prompt\": \"Show me blocked orders\"}'"  ,
            "expected_output": "List of blocked orders from SAP system"
        },
        {
            "purpose": "Check CloudWatch logs",
            "command": "aws logs describe-log-groups --log-group-name-prefix /aws/bedrock/agentcore",
            "expected_output": "List of log groups for the agent"
        }
    ],
    "escalation_procedures": [
        {
            "level": "L1 - Operations Team",
            "responsibilities": ["Monitor dashboards", "Respond to alerts", "Basic troubleshooting"],
            "escalation_criteria": ["Unable to resolve in 30 minutes", "Critical system impact", "Multiple service failures"]
        },
        {
            "level": "L2 - Engineering Team",
            "responsibilities": ["Advanced troubleshooting", "Code-level debugging", "Configuration changes"],
            "escalation_criteria": ["Requires code changes", "Architecture-level issues", "Unable to resolve in 2 hours"]
        },
        {
            "level": "L3 - Senior Engineering",
            "responsibilities": ["System architecture decisions", "Vendor escalation", "Major incident response"],
            "escalation_criteria": ["System-wide outage", "Data integrity issues", "Security incidents"]
        }
    ]
}

print_info("🔧 **Troubleshooting Guide:**")

print(f"\n🚨 **Common Issues ({len(troubleshooting_guide['common_issues'])}):**")
for issue in troubleshooting_guide['common_issues']:
    print(f"\n   **{issue['issue']}:**")
    print(f"   - Symptoms: {', '.join(issue['symptoms'][:2])}...")
    print(f"   - Primary Causes: {', '.join(issue['causes'][:2])}...")
    print(f"   - Solutions Available: {len(issue['solutions'])}")

print(f"\n🔍 **Diagnostic Commands ({len(troubleshooting_guide['diagnostic_commands'])}):**")
for cmd in troubleshooting_guide['diagnostic_commands']:
    print(f"   - {cmd['purpose']}: {cmd['command'][:50]}...")

print(f"\n📞 **Escalation Levels ({len(troubleshooting_guide['escalation_procedures'])}):**")
for level in troubleshooting_guide['escalation_procedures']:
    print(f"   - {level['level']}: {len(level['responsibilities'])} responsibilities")

# Save troubleshooting guide
with open('troubleshooting_guide.json', 'w') as f:
    json.dump(troubleshooting_guide, f, indent=2)

print_success("Troubleshooting guide created")
print_info("Guide saved to: troubleshooting_guide.json")

## Step 7: Save Lab Progress

Let's save our progress and prepare for the final cleanup lab.

In [None]:
# Save lab progress
print_header("Saving Lab Progress", level=2)

# Mark lab as complete
lab_resources = {
    "observability_features": [
        "cloudwatch_metrics",
        "performance_analytics",
        "error_tracking",
        "usage_analytics",
        "custom_dashboards",
        "alerting_system",
        "troubleshooting_guide"
    ],
    "monitoring_metrics": [
        "request_volume",
        "response_times",
        "success_rates",
        "error_rates",
        "sap_integration_health",
        "resource_utilization",
        "user_analytics"
    ],
    "alert_configurations": [
        "high_response_time",
        "low_success_rate",
        "sap_connection_issues",
        "error_rate_threshold",
        "resource_utilization"
    ],
    "dashboard_files": {
        "dashboard_config": "observability_dashboard.json",
        "troubleshooting_guide": "troubleshooting_guide.json"
    },
    "performance_summary": {
        "avg_response_time_ms": round(avg_response_time),
        "success_rate_percent": round(avg_success_rate, 2),
        "total_requests_24h": total_requests,
        "error_rate_percent": round((total_errors/total_requests)*100, 3)
    }
}

workshop_progress.mark_lab_complete(6, lab_resources)

# Display progress
workshop_progress.display_progress()

print_success("Lab 6 completed successfully!")
print_info("Comprehensive observability is now configured for the production SAP agent")
print_info("Ready for final cleanup in Lab 7")

## 🎉 Lab 6 Complete!

Outstanding work! You've successfully implemented comprehensive observability for your production SAP agent. Here's what you accomplished:

### ✅ What You Built
- Created CloudWatch metrics and dashboards
- Implemented performance analytics and trending
- Set up error tracking and alerting systems
- Built usage analytics and user behavior insights
- Configured comprehensive monitoring dashboards
- Created troubleshooting guides and procedures
- Established escalation procedures

### 🧠 Key Concepts Learned
- **Observability Strategy**: Comprehensive monitoring approach
- **Performance Analytics**: Understanding system behavior
- **Error Tracking**: Proactive issue identification
- **Usage Analytics**: User behavior and optimization insights

### 🔄 Current Architecture
```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Interfaces    │───▶│ AgentCore       │───▶│ SAP Agent       │
│   • Streamlit   │    │ Runtime         │    │ (Production)    │
│   • FastAPI     │    │ • Monitoring    │    │                 │
│   • Direct      │    │ • Scaling       │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐
                       │ Observability   │
                       │ • CloudWatch    │
                       │ • Dashboards    │
                       │ • Alerts        │
                       │ • Analytics     │
                       │ • Troubleshoot  │
                       └─────────────────┘
```

### 🚀 Next Steps
In **Lab 7**, we'll complete the workshop by:
- **Cleaning up resources** created during the workshop
- **Reviewing what we built** and key learnings
- **Planning next steps** for production deployment
- **Providing additional resources** for continued learning

### 💡 Key Takeaways
1. **Observability is critical** for production AI agents
2. **Proactive monitoring** prevents issues before they impact users
3. **Usage analytics** drive optimization and feature decisions
4. **Comprehensive dashboards** enable quick issue identification
5. **Troubleshooting guides** reduce mean time to resolution

### 🔍 Observability Benefits Achieved
- **Visibility**: Complete insight into agent performance
- **Proactive Monitoring**: Issues detected before user impact
- **Performance Optimization**: Data-driven improvement decisions
- **Operational Excellence**: Streamlined troubleshooting procedures
- **User Experience**: Understanding and improving user interactions

Your SAP agent now has enterprise-grade observability and monitoring!

Ready to wrap up the workshop? **[Continue to Lab 7 →](lab-07-cleanup.ipynb)**