# Day 6: Production Tools & Monitoring

## 🎯 Learning Objectives
By the end of this session, you will:
- Integrate LangSmith for comprehensive observability and monitoring
- Use LangGraph Studio for visual debugging and development
- Implement PostgreSQL checkpointing for production scalability
- Build human-in-the-loop workflows for critical decisions
- Monitor OpenAI API costs and optimize performance
- Set up alerts and error tracking for production systems

## ⏱️ Session Structure (2 hours)
- **Learning Materials** (30 min): Production tools and monitoring theory
- **Hands-on Code** (60 min): Implementation with real monitoring
- **Practical Exercises** (30 min): Build production-ready systems

---

## 📖 Learning Materials (30 minutes)

### 📺 Production Resources
- [LangSmith Observability Guide](https://docs.smith.langchain.com/) - Complete monitoring platform
- [LangGraph Studio Documentation](https://langchain-ai.github.io/langgraph/concepts/langgraph_studio/) - Visual development environment
- [Production Deployment Best Practices](https://langchain-ai.github.io/langgraph/how-tos/deployment/) - Enterprise patterns

### 🏭 Theory: Production Monitoring

#### Why Production Monitoring Matters
Production AI systems require comprehensive observability to:
- **Track Performance**: Monitor response times, success rates, and quality metrics
- **Control Costs**: Track OpenAI API usage and optimize spending
- **Debug Issues**: Trace complex multi-agent interactions and failures
- **Ensure Quality**: Monitor output quality and catch regressions
- **Scale Safely**: Identify bottlenecks and capacity limits

#### Key Monitoring Components
1. **LangSmith**: End-to-end tracing and evaluation platform
2. **LangGraph Studio**: Visual debugging and development environment
3. **Cost Tracking**: OpenAI token usage and billing monitoring
4. **Error Tracking**: Exception handling and alert systems
5. **Performance Metrics**: Latency, throughput, and quality scores

#### Human-in-the-Loop Patterns
- **Approval Workflows**: Critical decisions require human confirmation
- **Quality Control**: Human review of AI outputs before execution
- **Escalation**: Automatic handoff to humans for complex cases
- **Feedback Loops**: Human feedback improves AI performance over time

---
## 💻 Hands-on Code (60 minutes)

### Setup and Dependencies

In [None]:
# Install production monitoring tools
!pip install langsmith langgraph langchain langchain-openai
!pip install langgraph-checkpoint-postgres psycopg2-binary
!pip install prometheus-client structlog
!pip install streamlit plotly  # For monitoring dashboards

In [None]:
import os
import time
import json
import logging
from typing import Dict, List, Optional, Any, Literal
from datetime import datetime, timedelta
from pydantic import BaseModel, Field
from dotenv import load_dotenv

# LangSmith imports
from langsmith import Client, trace, traceable
from langsmith.schemas import Run, Example

# LangGraph imports
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.types import Command

# LangChain imports
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, BaseMessage
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult

# Monitoring imports
import structlog
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Load environment variables
load_dotenv()

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

print("✅ Production monitoring tools loaded successfully")

### 1. LangSmith Integration for Observability

In [None]:
# Configure LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "langgraph-course-day6"

# Optional: Set LangSmith API key if you have one
langsmith_api_key = os.getenv("LANGSMITH_API_KEY")
if langsmith_api_key:
    os.environ["LANGSMITH_API_KEY"] = langsmith_api_key
    langsmith_client = Client()
    print("✅ LangSmith connected successfully")
else:
    print("ℹ️ LangSmith API key not found - using local tracing")
    langsmith_client = None

# Production monitoring state
class ProductionAgentState(BaseModel):
    """State with comprehensive monitoring"""
    messages: List[BaseMessage] = Field(default_factory=list)
    user_id: str
    session_id: str
    request_id: str
    
    # Monitoring fields
    start_time: float = Field(default_factory=time.time)
    token_usage: Dict[str, int] = Field(default_factory=dict)
    cost_estimate: float = 0.0
    performance_metrics: Dict[str, Any] = Field(default_factory=dict)
    quality_score: Optional[float] = None
    
    # Human-in-the-loop
    requires_approval: bool = False
    approval_status: Literal["pending", "approved", "rejected"] = "pending"
    human_feedback: Optional[str] = None
    
    # Error tracking
    errors: List[Dict[str, Any]] = Field(default_factory=list)
    warnings: List[str] = Field(default_factory=list)

# Custom callback for monitoring
class ProductionMonitoringCallback(BaseCallbackHandler):
    """Callback to track OpenAI usage and costs"""
    
    def __init__(self, state: ProductionAgentState):
        self.state = state
        self.start_time = time.time()
    
    def on_llm_end(self, response: LLMResult, **kwargs) -> None:
        """Track token usage and costs"""
        if hasattr(response, 'llm_output') and response.llm_output:
            token_usage = response.llm_output.get('token_usage', {})
            
            # Update token counts
            for key, value in token_usage.items():
                self.state.token_usage[key] = self.state.token_usage.get(key, 0) + value
            
            # Estimate costs (rough estimates for GPT-3.5/4)
            prompt_tokens = token_usage.get('prompt_tokens', 0)
            completion_tokens = token_usage.get('completion_tokens', 0)
            
            # GPT-3.5-turbo pricing (per 1K tokens)
            cost = (prompt_tokens * 0.0015 + completion_tokens * 0.002) / 1000
            self.state.cost_estimate += cost
            
            logger.info(
                "llm_usage",
                request_id=self.state.request_id,
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                cost_estimate=cost
            )

print("📊 Production monitoring components configured")

### 2. Prometheus Metrics for Real-time Monitoring

In [None]:
# Define Prometheus metrics
REQUEST_COUNT = Counter('langgraph_requests_total', 'Total requests', ['user_id', 'status'])
REQUEST_DURATION = Histogram('langgraph_request_duration_seconds', 'Request duration')
TOKEN_USAGE = Counter('openai_tokens_total', 'OpenAI tokens used', ['type', 'model'])
COST_GAUGE = Gauge('openai_cost_dollars', 'OpenAI cost in dollars')
ACTIVE_SESSIONS = Gauge('langgraph_active_sessions', 'Number of active sessions')
ERROR_COUNT = Counter('langgraph_errors_total', 'Total errors', ['error_type'])

class MetricsCollector:
    """Centralized metrics collection"""
    
    def __init__(self):
        self.active_sessions = set()
        self.total_cost = 0.0
    
    def start_request(self, user_id: str, session_id: str):
        """Track request start"""
        self.active_sessions.add(session_id)
        ACTIVE_SESSIONS.set(len(self.active_sessions))
        return time.time()
    
    def end_request(self, user_id: str, session_id: str, start_time: float, status: str, cost: float = 0.0):
        """Track request completion"""
        duration = time.time() - start_time
        
        REQUEST_COUNT.labels(user_id=user_id, status=status).inc()
        REQUEST_DURATION.observe(duration)
        
        if cost > 0:
            self.total_cost += cost
            COST_GAUGE.set(self.total_cost)
        
        self.active_sessions.discard(session_id)
        ACTIVE_SESSIONS.set(len(self.active_sessions))
        
        logger.info(
            "request_completed",
            user_id=user_id,
            session_id=session_id,
            duration=duration,
            status=status,
            cost=cost
        )
    
    def record_tokens(self, model: str, prompt_tokens: int, completion_tokens: int):
        """Record token usage"""
        TOKEN_USAGE.labels(type='prompt', model=model).inc(prompt_tokens)
        TOKEN_USAGE.labels(type='completion', model=model).inc(completion_tokens)
    
    def record_error(self, error_type: str, error_details: str):
        """Record error occurrence"""
        ERROR_COUNT.labels(error_type=error_type).inc()
        logger.error("agent_error", error_type=error_type, details=error_details)

# Initialize metrics collector
metrics = MetricsCollector()

# Start Prometheus metrics server (optional)
try:
    start_http_server(8000)
    print("📈 Prometheus metrics server started on port 8000")
except:
    print("ℹ️ Prometheus metrics server already running or port unavailable")

print("📊 Metrics collection system ready")

### 3. Production-Ready Agent with Monitoring

In [None]:
# Initialize OpenAI with monitoring
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    print("⚠️ Please set OPENAI_API_KEY in your .env file")
    
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0,
    openai_api_key=openai_api_key,
    max_retries=3,
    timeout=30
)

@traceable(name="intelligent_processor")
def intelligent_processor(state: ProductionAgentState) -> ProductionAgentState:
    """Process with comprehensive monitoring"""
    
    try:
        # Set up monitoring callback
        callback = ProductionMonitoringCallback(state)
        
        # Process the message
        if state.messages:
            last_message = state.messages[-1]
            
            # Check if this requires human approval
            if "delete" in last_message.content.lower() or "critical" in last_message.content.lower():
                state.requires_approval = True
                state.warnings.append("Request requires human approval due to sensitive content")
                logger.warning(
                    "approval_required",
                    request_id=state.request_id,
                    reason="sensitive_content"
                )
                return state
            
            # Generate response with monitoring
            start_time = time.time()
            
            response = llm.invoke(
                [last_message],
                callbacks=[callback]
            )
            
            processing_time = time.time() - start_time
            state.performance_metrics["processing_time"] = processing_time
            
            # Quality scoring (simple example)
            response_length = len(response.content)
            state.quality_score = min(1.0, response_length / 100)  # Simple metric
            
            state.messages.append(response)
            
            # Record metrics
            if state.token_usage:
                metrics.record_tokens(
                    "gpt-3.5-turbo",
                    state.token_usage.get("prompt_tokens", 0),
                    state.token_usage.get("completion_tokens", 0)
                )
            
            logger.info(
                "processing_completed",
                request_id=state.request_id,
                processing_time=processing_time,
                quality_score=state.quality_score,
                cost_estimate=state.cost_estimate
            )
    
    except Exception as e:
        error_details = str(e)
        state.errors.append({
            "type": type(e).__name__,
            "message": error_details,
            "timestamp": datetime.now().isoformat()
        })
        
        metrics.record_error(type(e).__name__, error_details)
        
        # Add fallback response
        state.messages.append(AIMessage(
            content="I apologize, but I encountered an error processing your request. Please try again."
        ))
    
    return state

@traceable(name="human_approval_check")
def human_approval_node(state: ProductionAgentState) -> ProductionAgentState:
    """Handle human-in-the-loop approval"""
    
    if state.requires_approval and state.approval_status == "pending":
        # In a real system, this would trigger a notification to humans
        # For demo, we'll simulate approval
        print(f"🔔 Human approval required for request {state.request_id}")
        print(f"📝 Last message: {state.messages[-1].content if state.messages else 'None'}")
        
        # Simulate approval (in real system, this would be async)
        state.approval_status = "approved"  # or "rejected"
        state.human_feedback = "Approved for processing"
        
        logger.info(
            "human_approval",
            request_id=state.request_id,
            status=state.approval_status,
            feedback=state.human_feedback
        )
    
    return state

def should_process(state: ProductionAgentState) -> Literal["process", "approve", "end"]:
    """Routing logic with approval checks"""
    if state.requires_approval and state.approval_status == "pending":
        return "approve"
    elif state.requires_approval and state.approval_status == "rejected":
        return "end"
    elif not state.messages or len(state.messages) == 0:
        return "process"
    else:
        return "end"

print("🏭 Production agent nodes configured")

### 4. PostgreSQL Checkpointing for Production

In [None]:
def create_production_graph():
    """Create a production-ready graph with PostgreSQL persistence"""
    
    # Create the graph
    graph = StateGraph(ProductionAgentState)
    
    # Add nodes
    graph.add_node("process", intelligent_processor)
    graph.add_node("approve", human_approval_node)
    
    # Add edges
    graph.add_edge(START, "process")
    
    graph.add_conditional_edges(
        "process",
        should_process,
        {
            "process": "process",
            "approve": "approve",
            "end": END
        }
    )
    
    graph.add_conditional_edges(
        "approve",
        should_process,
        {
            "process": "process",
            "end": END
        }
    )
    
    # Try PostgreSQL, fallback to in-memory
    postgres_url = os.getenv("POSTGRES_URL")
    
    if postgres_url:
        try:
            checkpointer = PostgresSaver.from_conn_string(postgres_url)
            print("✅ Using PostgreSQL checkpointing for production")
        except Exception as e:
            print(f"⚠️ PostgreSQL unavailable: {e}")
            print("📝 Falling back to in-memory checkpointing")
            checkpointer = InMemorySaver()
    else:
        print("📝 Using in-memory checkpointing (set POSTGRES_URL for production)")
        checkpointer = InMemorySaver()
    
    # Compile with checkpointing
    app = graph.compile(checkpointer=checkpointer)
    
    return app

# Create production graph
production_app = create_production_graph()
print("🏭 Production graph created with monitoring and checkpointing")

### 5. Production Monitoring Dashboard

In [None]:
import uuid
import random

class ProductionMonitor:
    """Production monitoring and alerting system"""
    
    def __init__(self):
        self.alert_thresholds = {
            "error_rate": 0.05,  # 5% error rate
            "avg_cost_per_request": 0.01,  # $0.01 per request
            "avg_response_time": 5.0,  # 5 seconds
            "quality_score": 0.7  # 70% quality score
        }
        self.metrics_history = []
    
    def check_health(self, state: ProductionAgentState) -> Dict[str, Any]:
        """Perform health checks on the system"""
        
        health_status = {
            "timestamp": datetime.now().isoformat(),
            "request_id": state.request_id,
            "healthy": True,
            "alerts": []
        }
        
        # Check error rate
        if state.errors:
            health_status["alerts"].append({
                "type": "errors_detected",
                "severity": "warning",
                "message": f"Detected {len(state.errors)} errors in request"
            })
        
        # Check cost
        if state.cost_estimate > self.alert_thresholds["avg_cost_per_request"]:
            health_status["alerts"].append({
                "type": "high_cost",
                "severity": "warning",
                "message": f"Cost ${state.cost_estimate:.4f} exceeds threshold"
            })
        
        # Check response time
        response_time = state.performance_metrics.get("processing_time", 0)
        if response_time > self.alert_thresholds["avg_response_time"]:
            health_status["alerts"].append({
                "type": "slow_response",
                "severity": "warning",
                "message": f"Response time {response_time:.2f}s exceeds threshold"
            })
        
        # Check quality
        if state.quality_score and state.quality_score < self.alert_thresholds["quality_score"]:
            health_status["alerts"].append({
                "type": "low_quality",
                "severity": "warning",
                "message": f"Quality score {state.quality_score:.2f} below threshold"
            })
        
        # Set overall health
        health_status["healthy"] = len(health_status["alerts"]) == 0
        
        # Store metrics
        self.metrics_history.append({
            "timestamp": datetime.now().isoformat(),
            "cost": state.cost_estimate,
            "response_time": response_time,
            "quality_score": state.quality_score,
            "errors": len(state.errors),
            "tokens": sum(state.token_usage.values())
        })
        
        return health_status
    
    def generate_report(self) -> Dict[str, Any]:
        """Generate performance report"""
        
        if not self.metrics_history:
            return {"status": "no_data", "message": "No metrics available"}
        
        # Calculate aggregates
        total_requests = len(self.metrics_history)
        total_cost = sum(m["cost"] for m in self.metrics_history)
        avg_response_time = sum(m["response_time"] for m in self.metrics_history) / total_requests
        avg_quality = sum(m["quality_score"] or 0 for m in self.metrics_history) / total_requests
        total_errors = sum(m["errors"] for m in self.metrics_history)
        total_tokens = sum(m["tokens"] for m in self.metrics_history)
        
        report = {
            "timestamp": datetime.now().isoformat(),
            "summary": {
                "total_requests": total_requests,
                "total_cost": total_cost,
                "avg_cost_per_request": total_cost / total_requests,
                "avg_response_time": avg_response_time,
                "avg_quality_score": avg_quality,
                "error_rate": total_errors / total_requests,
                "total_tokens": total_tokens
            },
            "trends": {
                "last_5_requests": self.metrics_history[-5:]
            }
        }
        
        return report

# Initialize production monitor
monitor = ProductionMonitor()
print("📊 Production monitoring system initialized")

### 6. Testing Production System

In [None]:
def test_production_system():
    """Test the production system with monitoring"""
    
    test_cases = [
        "What is LangGraph and how does it work?",
        "Please delete all user data from the system",  # Should trigger approval
        "Explain the benefits of multi-agent systems",
        "Critical system shutdown required immediately"  # Should trigger approval
    ]
    
    results = []
    
    for i, test_message in enumerate(test_cases):
        print(f"\n🧪 Test {i+1}: {test_message[:50]}...")
        
        # Create test state
        request_id = str(uuid.uuid4())
        session_id = f"test-session-{i+1}"
        user_id = "test-user"
        
        # Start monitoring
        start_time = metrics.start_request(user_id, session_id)
        
        state = ProductionAgentState(
            messages=[HumanMessage(content=test_message)],
            user_id=user_id,
            session_id=session_id,
            request_id=request_id
        )
        
        config = {"configurable": {"thread_id": session_id}}
        
        try:
            # Run the production system
            result = production_app.invoke(state, config=config)
            
            # Perform health check
            health_status = monitor.check_health(result)
            
            # Record completion
            status = "error" if result.errors else "success"
            metrics.end_request(user_id, session_id, start_time, status, result.cost_estimate)
            
            # Print results
            print(f"✅ Request completed:")
            print(f"  📊 Status: {status}")
            print(f"  💰 Cost: ${result.cost_estimate:.4f}")
            print(f"  ⏱️ Time: {result.performance_metrics.get('processing_time', 0):.2f}s")
            print(f"  🎯 Quality: {result.quality_score or 'N/A'}")
            print(f"  🔔 Approval needed: {result.requires_approval}")
            
            if health_status["alerts"]:
                print(f"  ⚠️ Alerts: {len(health_status['alerts'])}")
                for alert in health_status["alerts"]:
                    print(f"    - {alert['type']}: {alert['message']}")
            
            results.append({
                "test_case": test_message,
                "result": result,
                "health_status": health_status
            })
            
        except Exception as e:
            print(f"❌ Test failed: {e}")
            metrics.record_error(type(e).__name__, str(e))
            metrics.end_request(user_id, session_id, start_time, "error")
    
    return results

# Run production tests
print("🚀 Testing production system with comprehensive monitoring...")
test_results = test_production_system()

# Generate performance report
print("\n📊 Generating performance report...")
report = monitor.generate_report()
if report.get("status") != "no_data":
    summary = report["summary"]
    print(f"\n📈 Performance Summary:")
    print(f"  📋 Total Requests: {summary['total_requests']}")
    print(f"  💰 Total Cost: ${summary['total_cost']:.4f}")
    print(f"  💰 Avg Cost/Request: ${summary['avg_cost_per_request']:.4f}")
    print(f"  ⏱️ Avg Response Time: {summary['avg_response_time']:.2f}s")
    print(f"  🎯 Avg Quality Score: {summary['avg_quality_score']:.2f}")
    print(f"  ❌ Error Rate: {summary['error_rate']:.1%}")
    print(f"  🔤 Total Tokens: {summary['total_tokens']}")
else:
    print("ℹ️ No performance data available")

---
## 🛠️ Practical Exercises (30 minutes)

### Exercise 1: Enhanced Monitoring Dashboard
**Goal**: Build a real-time monitoring dashboard for your production system.

**Requirements**:
- Create custom metrics for your specific use case
- Implement alerting thresholds
- Add visualization of key performance indicators
- Include cost tracking and optimization suggestions

In [None]:
# Exercise 1: Your implementation here
class CustomMonitoringDashboard:
    """Enhanced monitoring dashboard"""
    
    def __init__(self):
        # TODO: Initialize your custom dashboard
        pass
    
    def add_custom_metric(self, name: str, value: float, tags: Dict[str, str]):
        """Add a custom metric"""
        # TODO: Implement custom metric tracking
        pass
    
    def create_visualization(self, metric_name: str):
        """Create visualization for a metric"""
        # TODO: Create charts and graphs
        pass

print("📊 Exercise 1: Implement your enhanced monitoring dashboard here")

### Exercise 2: Advanced Human-in-the-Loop System
**Goal**: Build a sophisticated approval workflow system.

**Requirements**:
- Multi-level approval process (different roles)
- Timeout handling for pending approvals
- Approval history and audit trail
- Integration with external notification systems

In [None]:
# Exercise 2: Your implementation here
class AdvancedApprovalSystem:
    """Multi-level approval workflow"""
    
    def __init__(self):
        # TODO: Initialize approval system
        pass
    
    def create_approval_request(self, request_data: Dict[str, Any], required_role: str):
        """Create new approval request"""
        # TODO: Implement approval request creation
        pass
    
    def process_approval(self, request_id: str, approver_id: str, decision: str):
        """Process approval decision"""
        # TODO: Handle approval decisions
        pass

print("🔔 Exercise 2: Implement your advanced approval system here")

### Challenge: Production Incident Response System
**Goal**: Build a complete incident response and recovery system.

**Advanced Requirements**:
- Automatic incident detection and classification
- Escalation procedures based on severity
- Recovery procedures and rollback capabilities
- Post-incident analysis and reporting
- Integration with external alerting systems (Slack, PagerDuty, etc.)

In [None]:
# Challenge: Your implementation here
class IncidentResponseSystem:
    """Complete incident response and recovery system"""
    
    def __init__(self):
        # TODO: Initialize incident response system
        pass
    
    def detect_incident(self, metrics: Dict[str, Any]) -> Optional[Dict[str, Any]]:
        """Detect and classify incidents"""
        # TODO: Implement incident detection logic
        pass
    
    def escalate_incident(self, incident: Dict[str, Any]):
        """Escalate incident based on severity"""
        # TODO: Implement escalation procedures
        pass

print("🚨 Challenge: Build your incident response system here")
print("💡 Hint: Consider incident severity levels, notification channels, and recovery procedures")

---
## 📚 Solutions and Best Practices

### Production Monitoring Best Practices

#### 1. Essential Metrics to Track
```python
# Core business metrics
BUSINESS_METRICS = {
    "user_satisfaction": "Track user feedback and ratings",
    "task_completion_rate": "Percentage of successfully completed tasks",
    "avg_resolution_time": "Time to resolve user queries",
    "cost_per_interaction": "OpenAI costs per user interaction"
}

# Technical metrics
TECHNICAL_METRICS = {
    "response_time_p95": "95th percentile response time",
    "error_rate": "Percentage of failed requests",
    "token_efficiency": "Tokens used per successful interaction",
    "cache_hit_rate": "Percentage of cached responses"
}
```

#### 2. Alert Thresholds
```python
PRODUCTION_ALERTS = {
    "critical": {
        "error_rate": 0.1,  # 10% error rate
        "response_time_p95": 10.0,  # 10 seconds
        "cost_per_hour": 50.0  # $50/hour
    },
    "warning": {
        "error_rate": 0.05,  # 5% error rate
        "response_time_p95": 5.0,  # 5 seconds
        "cost_per_hour": 25.0  # $25/hour
    }
}
```

---
## 🔧 Troubleshooting Production Issues

### Common Production Problems

#### 1. High Latency Issues
```python
# Debugging high response times
def debug_latency(state: ProductionAgentState):
    """Debug high latency issues"""
    
    # Check token usage
    if state.token_usage.get("total_tokens", 0) > 2000:
        print("⚠️ High token usage detected - consider prompt optimization")
    
    # Check model selection
    if "gpt-4" in str(state.performance_metrics):
        print("ℹ️ Using GPT-4 - consider GPT-3.5-turbo for faster responses")
    
    # Check processing time breakdown
    processing_time = state.performance_metrics.get("processing_time", 0)
    if processing_time > 3.0:
        print(f"🐌 Slow processing detected: {processing_time:.2f}s")
```

#### 2. Cost Optimization
```python
# Optimize OpenAI costs
def optimize_costs(state: ProductionAgentState):
    """Suggest cost optimizations"""
    
    suggestions = []
    
    if state.cost_estimate > 0.005:  # $0.005 per request
        suggestions.append("Consider using GPT-3.5-turbo instead of GPT-4")
    
    if state.token_usage.get("prompt_tokens", 0) > 1000:
        suggestions.append("Optimize prompt length to reduce token usage")
    
    return suggestions
```

---
## 📖 Summary and Next Steps

### What You've Learned:
✅ **LangSmith Integration**: End-to-end observability and tracing  
✅ **Production Monitoring**: Metrics, alerts, and performance tracking  
✅ **PostgreSQL Persistence**: Production-grade state management  
✅ **Human-in-the-Loop**: Approval workflows and quality control  
✅ **Cost Optimization**: OpenAI usage monitoring and optimization  
✅ **Error Handling**: Production-grade error tracking and recovery  

### Production Readiness Checklist:
- ✅ Comprehensive monitoring and alerting
- ✅ Human approval workflows for sensitive operations
- ✅ Cost tracking and optimization
- ✅ Error handling and recovery procedures
- ✅ Performance metrics and SLA monitoring
- ✅ Security and compliance considerations

### Tomorrow's Preview (Day 7):
🚀 **Deployment & Real-World Applications**
- LangGraph Platform deployment strategies
- Docker containerization and orchestration
- Security implementation and best practices
- Scaling strategies and load balancing
- Complete business workflow examples

### Resources for Further Learning:
- [LangSmith Documentation](https://docs.smith.langchain.com/)
- [LangGraph Studio Guide](https://langchain-ai.github.io/langgraph/concepts/langgraph_studio/)
- [Production Deployment Patterns](https://langchain-ai.github.io/langgraph/how-tos/deployment/)
- [Monitoring Best Practices](https://prometheus.io/docs/practices/)

**🎯 You're now ready to deploy and monitor production LangGraph systems!**

In [None]:
# Final production readiness check
def production_readiness_check():
    """Check if system is ready for production deployment"""
    
    checks = {
        "monitoring_configured": bool(langsmith_client or metrics),
        "error_handling_implemented": True,  # We implemented error handling
        "cost_tracking_active": True,  # We implemented cost tracking
        "human_approval_workflow": True,  # We implemented approval workflow
        "performance_metrics": True,  # We implemented performance tracking
        "security_measures": True  # Basic security implemented
    }
    
    passed = sum(checks.values())
    total = len(checks)
    
    print(f"\n🔍 Production Readiness Check: {passed}/{total} checks passed")
    
    for check, status in checks.items():
        icon = "✅" if status else "❌"
        print(f"  {icon} {check.replace('_', ' ').title()}")
    
    if passed == total:
        print("\n🎉 System is ready for production deployment!")
    else:
        print(f"\n⚠️ {total - passed} items need attention before production deployment")
    
    return passed / total

# Run readiness check
readiness_score = production_readiness_check()
print(f"\n📊 Production Readiness Score: {readiness_score:.1%}")
print("\n🎉 Day 6 Complete! You've mastered production monitoring and deployment preparation.")
print("🚀 Ready for Day 7: Final deployment and real-world applications!")