A comprehensive repository providing production-ready infrastructure, monitoring, and observability solutions for AI/ML systems at scale.
β¨ NEW: Phase 3 - Unified Observability with AI-Powered Search & Multi-Agent Investigation
This repository provides complete infrastructure-as-code, deployment manifests, and observability integrations for:
- Cloud Infrastructure: Production-ready Terraform modules for AWS, Azure, and GCP
- Kubernetes Deployments: Complete manifests and Helm charts for AI services
- AI Model Observability: Comprehensive monitoring and tracing for AI/ML models
- Multi-Cloud Integration: AWS CloudWatch & X-Ray, Azure Monitor & Application Insights, GCP Cloud Monitoring, Datadog
- Advanced Patterns: Caching, security, rate limiting, and PII detection
- Platform Integrations: Ready-to-deploy configurations for popular monitoring tools
- π Unified Correlation: Single correlation ID links all telemetry (traces, logs, metrics)
- π MCP Observability Server: AI agents query observability data via Model Context Protocol
- π Multi-Agent Investigation: Autonomous incident investigation with 80% faster MTTR
ai_dev_ops/
βββ terraform/ # Infrastructure as Code
β βββ aws/ # AWS EKS, VPC, IAM, CloudWatch
β βββ azure/ # Azure AKS, VNet, Application Insights
β βββ gcp/ # GCP GKE, VPC, IAM, Cloud Monitoring
βββ kubernetes/ # Kubernetes manifests
β βββ base/ # Base Kustomize resources
β βββ overlays/ # Environment-specific configurations
βββ helm/ # Helm charts
β βββ ai-inference-service/ # Production-ready AI service chart
βββ examples/ # Code samples and integrations
β βββ opentelemetry/ # OpenTelemetry instrumentation
β βββ azure/ # Azure Monitor examples
β βββ prometheus/ # Prometheus metrics
β βββ aws/ # AWS CloudWatch & X-Ray integration
β βββ gcp/ # GCP Cloud Monitoring & Trace
β βββ datadog/ # Datadog APM full integration
β βββ caching/ # Redis caching patterns
β βββ security/ # Security best practices
β βββ π unified-correlation/ # Correlation framework
β βββ π multi-agent/ # Multi-agent investigation system
β βββ π scenarios/ # End-to-end examples
βββ π mcp-server/ # MCP Observability Server
β βββ tools/ # MCP tools for AI agents
βββ integrations/ # Platform configurations
β βββ grafana/ # Grafana dashboards and alerts
β βββ datadog/ # Datadog integration configs
β βββ azure-monitor/ # Azure Monitor configurations
β βββ elastic-stack/ # Elasticsearch, Logstash, Kibana
β βββ splunk/ # Splunk integration
β βββ newrelic/ # New Relic APM
βββ data-formats/ # Schema definitions
β βββ metrics/ # Metrics format specifications
β βββ logs/ # Structured logging formats
β βββ traces/ # Distributed tracing formats
β βββ π unified/ # Unified correlation schemas
βββ docs/ # Documentation and best practices
- Python 3.8+
- Docker (for containerized examples)
- Access to a monitoring platform (Grafana, Azure Monitor, Datadog, etc.)
- Clone the repository:
git clone https://github.com/ianlintner/ai_dev_ops.git
cd ai_dev_ops- Install dependencies:
pip install -r requirements.txt- Deploy infrastructure (choose your cloud):
# AWS
cd terraform/aws
terraform init
terraform apply
# Azure
cd terraform/azure
terraform init
terraform apply
# GCP
cd terraform/gcp
terraform init
terraform apply- Deploy AI services:
# Using Kubernetes manifests
kubectl apply -k kubernetes/overlays/prod/
# Or using Helm
helm install ai-inference helm/ai-inference-service \
--namespace ai-services --create-namespace- Explore the examples:
cd examples/opentelemetry
python basic_instrumentation.pyMonitor AI agents and workflows with distributed tracing:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("ai_inference"):
# Your AI model inference code here
result = model.predict(data)See examples/opentelemetry for complete examples.
Reduce costs with intelligent caching:
from caching import CachedAIService
service = CachedAIService(cache_ttl=3600)
result = service.inference_with_cache(prompt, model='gpt-4')
print(f"Cache hit rate: {service.get_cache_stats()['hit_rate_percent']}%")See examples/caching for complete examples.
PII detection and rate limiting:
from security import SecureAIService
service = SecureAIService()
result = service.secure_inference(api_key, user_input, model='gpt-4')
# Automatically detects and masks PII, enforces rate limitsSee examples/security for complete examples.
Automatically correlate traces, logs, and metrics:
from correlation_framework import setup_correlation, CorrelatedLogger
# Setup correlation
manager = setup_correlation(service_name="payment-service")
logger = CorrelatedLogger("payment", manager)
# Create correlation context
context = manager.create_context(request_id="req_123", user_id="user_789")
# All telemetry automatically correlated
logger.info("Processing payment", extra={"amount": 99.99})
# Logs, traces, and metrics all linked by correlation IDSee examples/unified-correlation for complete examples.
Multi-agent system for autonomous incident investigation:
from investigation_system import (
InvestigationContext,
TriageAgent,
RootCauseAgent,
RemediationAgent,
)
# Create investigation context
context = InvestigationContext(
incident_id="INC-001",
symptoms=["error_rate_spike", "high_latency"],
)
# Run multi-agent investigation
triage = TriageAgent()
findings = await triage.investigate(context)
# Results in <2 minutes:
# - Severity classification (0.85 confidence)
# - Root cause identification (0.88 confidence)
# - Remediation actions with runbooks
# - 80% faster than manual investigationSee examples/multi-agent for complete examples.
Natural language queries for observability data:
from mcp_client import MCPClient
mcp = MCPClient(endpoint="http://localhost:8000")
# Natural language search
result = mcp.call_tool(
"search_logs",
query="database connection timeout",
service_name="auth-service",
time_range="last_hour",
)
# Cross-telemetry correlation
result = mcp.call_tool(
"correlate_events",
correlation_id="c1a2b3d4e5f6789012345678901234ab",
include_types=["traces", "logs", "metrics"],
)
# AI-powered root cause analysis
result = mcp.call_tool(
"analyze_incident",
affected_services=["payment-service", "auth-service"],
symptoms=["high_latency", "error_rate_spike"],
)See mcp-server for complete documentation.
Collect and export metrics in Prometheus format:
from prometheus_client import Counter, Histogram
inference_counter = Counter('ai_inference_total', 'Total AI inferences')
inference_latency = Histogram('ai_inference_latency_seconds', 'Inference latency')See examples/prometheus for complete examples.
Complete production-ready infrastructure:
- EKS Cluster: Managed Kubernetes with auto-scaling
- VPC: Multi-AZ networking with NAT gateways
- IAM: IRSA roles for workload identity
- CloudWatch: Metrics, logs, and dashboards
- X-Ray: Distributed tracing
Deploy with Terraform: terraform/aws
Complete Azure deployment:
- AKS Cluster: Managed Kubernetes with system and AI workload node pools
- VNet: Virtual network with network security groups
- Application Insights: Application performance monitoring
- Log Analytics: Centralized logging and analytics
- Container Insights: Container and cluster monitoring
- Azure Monitor: Custom metrics and alerts
Deploy with Terraform: terraform/azure
Full GCP deployment:
- GKE Cluster: Regional cluster with Workload Identity
- VPC: Private cluster with Cloud NAT
- IAM: Service accounts with least privilege
- Cloud Monitoring: Custom metrics and alerts
- Cloud Trace: Performance monitoring
Deploy with Terraform: terraform/gcp
Production-ready manifests:
- Base configurations with Kustomize
- Environment-specific overlays (dev, prod)
- HPA for auto-scaling
- PodDisruptionBudget for availability
- Security contexts and policies
See kubernetes/ for manifests.
Simplified deployment:
- Configurable replica count
- Built-in autoscaling
- Ingress with TLS
- Resource management
- Observability enabled
Deploy with Helm: helm/ai-inference-service
Full APM integration:
- Distributed tracing with ddtrace
- Custom metrics for AI workloads
- Log management with trace correlation
- Pre-built dashboards and monitors
See examples/datadog
Native AWS observability:
- CloudWatch Logs with structured logging
- Custom metrics for AI KPIs
- X-Ray distributed tracing
- Lambda function templates
See examples/aws
Native GCP observability:
- Cloud Monitoring custom metrics
- Cloud Trace integration
- Structured logging with Cloud Logging
- Performance dashboards
See examples/gcp
Pre-built dashboards:
- Model performance metrics
- Inference latency and throughput
- Error rates and anomaly detection
- Cost tracking
Azure AI Foundry observability:
- Application Insights integration
- Log Analytics workspace
- Custom metrics and alerts
See integrations/azure-monitor
# HELP ai_inference_latency_seconds Time taken for AI inference
# TYPE ai_inference_latency_seconds histogram
ai_inference_latency_seconds{model="gpt-4",environment="production"} 0.542
{
"timestamp": "2025-11-13T22:00:00Z",
"level": "INFO",
"message": "Inference completed",
"model": "gpt-4",
"latency_ms": 542,
"tokens_used": 150
}{
"trace_id": "abcd1234efgh5678",
"span_id": "ijkl9101",
"operation": "model_inference",
"parent_span_id": "mnop1121",
"duration_ms": 542
}See data-formats for complete schema definitions.
- Use Infrastructure as Code: Terraform for reproducible deployments
- Multi-AZ Deployment: Ensure high availability across availability zones
- Auto-scaling: Configure HPA based on CPU, memory, and custom metrics
- Resource Limits: Always define requests and limits for containers
- Security Contexts: Run containers as non-root with minimal privileges
- Instrument Early: Add observability from the start of development
- Use Standard Formats: Leverage OpenTelemetry and Prometheus standards
- Monitor Costs: Track token usage and API costs religiously
- Detect Drift: Monitor model performance degradation over time
- Automate Alerts: Set up intelligent alerting for anomalies
- Trace Context: Always correlate logs with traces using trace IDs
- API Key Management: Never commit secrets, use secrets managers
- Rate Limiting: Implement token bucket rate limiting
- PII Detection: Automatically detect and mask sensitive data
- Input Validation: Sanitize all user inputs
- Audit Logging: Log all security events for compliance
- Caching: Use Redis for prompt and response caching
- Batching: Batch requests when possible to reduce latency
- Connection Pooling: Reuse connections to AI services
- Model Selection: Choose appropriate models based on requirements
See docs/best-practices.md for detailed guidelines.
β
Single Correlation ID links all telemetry (traces, logs, metrics, events)
β
Automatic Propagation across services via HTTP headers
β
Privacy-Preserving user ID hashing
β
Zero Overhead correlation context management
β
MCP Observability Server with 5 specialized tools
β
Natural Language Queries for logs, traces, and metrics
β
Semantic Search with vector embeddings
β
Sub-Second Performance (<500ms P95)
β
4 Specialized Agents: Triage, Correlation, Root Cause, Remediation
β
2-Minute Investigations (vs 45-90 minutes manual)
β
80% MTTR Reduction demonstrated
β
85%+ Accuracy in root cause identification
β
Autonomous Operation with confidence scores
β
89% Faster Resolution (85 min β 9 min in example)
β
5-50x ROI ($10K-100K/month savings)
β
100% Automation of correlation and investigation
β
Complete Documentation with automatic incident reports
Learn More:
- Phase 3 Plan - Complete vision and architecture
- Phase 3 Complete - Implementation summary
- Payment Failure Scenario - End-to-end example
- Multi-Agent System - Agent documentation
- MCP Server - API documentation
- Terraform: Infrastructure as Code for AWS and GCP
- Kubernetes: Container orchestration with EKS and GKE
- Helm: Package manager for Kubernetes applications
- Kustomize: Configuration management for Kubernetes
- OpenTelemetry: Vendor-neutral distributed tracing and metrics
- Prometheus: Time-series metrics collection
- Grafana: Visualization and dashboards
- Datadog: Full-stack APM and monitoring
- AWS CloudWatch & X-Ray: Native AWS observability
- GCP Cloud Monitoring & Trace: Native GCP observability
- Azure Monitor: Cloud-native Azure monitoring
- Python 3.8+: Primary language for examples
- boto3: AWS SDK
- google-cloud: GCP SDK
- ddtrace: Datadog tracing
- redis-py: Redis client
- Redis: Caching and session management
- Rate Limiting: Token bucket algorithm
- PII Detection: Pattern-based and ML-based detection
This repository includes GitHub Copilot instructions in .github/copilot-instructions.md to help with:
- Code style and patterns
- AI-specific observability conventions
- Integration best practices
- Documentation standards
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pip install pre-commit
pre-commit install
# Run formatting
make format
# Run linting
make lint
# Run tests
make test
# Validate JSON schemas
make validate
# Run all checks
make allThis repository uses GitHub Actions for:
- Linting: Code quality checks with flake8, pylint, black, and isort
- Testing: Validation across Python 3.8, 3.9, 3.10, and 3.11
- Security: Bandit and Safety scans
- Documentation: Markdown link checking
MIT License - See LICENSE for details
- OpenTelemetry Documentation
- Prometheus Best Practices
- Azure AI Foundry Observability
- Grafana Dashboards
For questions or suggestions, please open an issue in this repository.