# 10_高级应用案例

## 学习目标
- 掌握LangGraph在复杂应用场景中的使用
- 学习RAG（检索增强生成）系统的构建
- 理解多Agent协作系统的设计
- 掌握生产环境部署和最佳实践
- 学习性能优化和错误处理策略

## 核心概念

### RAG系统
检索增强生成（Retrieval-Augmented Generation）结合了信息检索和文本生成技术。

### 多Agent系统
多个智能Agent协作完成复杂任务的分布式系统。

### 工作流编排
复杂业务流程的自动化编排和执行。

## 环境设置

In [None]:
import asyncio
import json
import time
import uuid
from datetime import datetime, timedelta
from typing import Dict, Any, List, Optional, TypedDict, Callable
from dataclasses import dataclass, asdict
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.documents import Document
from langchain_core.tools import Tool
from langgraph.graph import StateGraph, START, END
from langgraph.graph.state import CompiledStateGraph
from langgraph.checkpoint.memory import MemorySaver
from langgraph.prebuilt import ToolNode
import logging
import numpy as np
import pandas as pd

# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("环境设置完成")

## 1. 高级RAG系统

### 1.1 智能文档检索和问答系统

本节将展示如何构建一个完整的RAG系统，包括文档检索、上下文管理和答案生成。

In [None]:
# 模拟向量数据库
class MockVectorDB:
    """模拟向量数据库"""
    
    def __init__(self):
        self.documents = []
        self.embeddings = {}
    
    def add_documents(self, docs: List[Document]):
        """添加文档"""
        for doc in docs:
            doc_id = str(uuid.uuid4())
            self.documents.append({"id": doc_id, "content": doc.page_content, "metadata": doc.metadata})
            # 模拟向量嵌入
            self.embeddings[doc_id] = np.random.rand(768)  # 模拟768维向量
    
    def similarity_search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
        """相似性搜索"""
        # 模拟搜索逻辑
        query_words = set(query.lower().split())
        results = []
        
        for doc in self.documents:
            content_words = set(doc["content"].lower().split())
            # 简单的词汇重叠相似度
            similarity = len(query_words.intersection(content_words)) / len(query_words.union(content_words))
            results.append({"document": doc, "similarity": similarity})
        
        # 按相似度排序并返回前k个
        results.sort(key=lambda x: x["similarity"], reverse=True)
        return [r["document"] for r in results[:k]]

# 创建知识库
knowledge_base = MockVectorDB()

# 添加示例文档
sample_docs = [
    Document(
        page_content="LangGraph是一个用于构建有状态、多参与者应用程序的库，基于LangChain构建。它扩展了LangChain Expression Language，增加了循环、条件分支和持久性等功能。",
        metadata={"source": "langgraph_intro.txt", "type": "documentation"}
    ),
    Document(
        page_content="StateGraph是LangGraph的核心组件，它允许您定义具有状态的计算图。每个节点代表一个计算步骤，边定义了执行流程。",
        metadata={"source": "stategragh_guide.txt", "type": "tutorial"}
    ),
    Document(
        page_content="检查点机制允许LangGraph应用程序保存和恢复状态。这对于构建能够从故障中恢复的弹性应用程序至关重要。",
        metadata={"source": "checkpoint_guide.txt", "type": "tutorial"}
    ),
    Document(
        page_content="工具调用是现代AI应用的核心功能。LangGraph提供了ToolNode来简化工具集成，支持函数调用和外部API集成。",
        metadata={"source": "tools_guide.txt", "type": "tutorial"}
    ),
    Document(
        page_content="多Agent系统允许多个AI代理协作完成复杂任务。每个Agent都有专门的职责和能力，通过协调机制实现目标。",
        metadata={"source": "multi_agent.txt", "type": "advanced"}
    )
]

knowledge_base.add_documents(sample_docs)
print(f"知识库已加载 {len(sample_docs)} 个文档")

### 1.2 RAG处理流程

定义RAG系统的核心处理流程，包括文档检索、答案生成和质量验证。

In [None]:
# RAG系统状态
class RAGState(TypedDict):
    query: str
    retrieved_docs: List[Dict[str, Any]]
    context: str
    answer: str
    confidence: float
    sources: List[str]
    processing_steps: List[str]

class MockLLM:
    """模拟大语言模型"""
    
    def generate_answer(self, context: str, query: str) -> Dict[str, Any]:
        """基于上下文生成答案"""
        # 模拟LLM生成答案
        if "langgraph" in query.lower():
            answer = f"根据检索到的文档，{query}的答案是：LangGraph是一个功能强大的库，用于构建有状态的多参与者应用程序。它基于LangChain构建，提供了循环、条件分支和持久性等高级功能。"
            confidence = 0.9
        elif "stategragh" in query.lower() or "状态图" in query.lower():
            answer = f"关于{query}：StateGraph是LangGraph的核心组件，用于定义具有状态的计算图。每个节点代表计算步骤，边定义执行流程。"
            confidence = 0.85
        elif "检查点" in query.lower() or "checkpoint" in query.lower():
            answer = f"针对{query}：检查点机制是LangGraph的重要功能，允许应用程序保存和恢复状态，构建弹性应用程序。"
            confidence = 0.8
        else:
            answer = f"根据可用信息，我对{query}的理解是：这是一个关于LangGraph相关技术的问题，建议查阅更多具体文档。"
            confidence = 0.6
        
        return {"answer": answer, "confidence": confidence}

# 创建模拟LLM实例
llm = MockLLM()

def document_retrieval_node(state: RAGState) -> RAGState:
    """文档检索节点"""
    query = state["query"]
    state["processing_steps"].append(f"正在检索与查询 '{query}' 相关的文档")
    
    # 从知识库检索相关文档
    retrieved_docs = knowledge_base.similarity_search(query, k=3)
    state["retrieved_docs"] = retrieved_docs
    
    # 构建上下文
    context_parts = []
    sources = []
    
    for i, doc in enumerate(retrieved_docs, 1):
        context_parts.append(f"文档{i}: {doc['content']}")
        sources.append(doc['metadata']['source'])
    
    state["context"] = "\n\n".join(context_parts)
    state["sources"] = sources
    state["processing_steps"].append(f"检索到 {len(retrieved_docs)} 个相关文档")
    
    return state

def answer_generation_node(state: RAGState) -> RAGState:
    """答案生成节点"""
    state["processing_steps"].append("正在生成答案")
    
    context = state["context"]
    query = state["query"]
    
    # 使用LLM生成答案
    result = llm.generate_answer(context, query)
    
    state["answer"] = result["answer"]
    state["confidence"] = result["confidence"]
    state["processing_steps"].append(f"答案生成完成，置信度: {result['confidence']:.2f}")
    
    return state

def answer_validation_node(state: RAGState) -> RAGState:
    """答案验证节点"""
    state["processing_steps"].append("正在验证答案质量")
    
    confidence = state["confidence"]
    answer = state["answer"]
    
    # 简单的答案质量检查
    if confidence < 0.7:
        state["answer"] = f"注意：答案置信度较低({confidence:.2f})。{answer}\n\n建议：请提供更具体的问题以获得更准确的答案。"
        state["processing_steps"].append("检测到低置信度答案，已添加警告信息")
    
    if len(answer) < 50:
        state["answer"] = f"{answer}\n\n补充说明：如需更详细信息，请查阅相关文档或提出更具体的问题。"
        state["processing_steps"].append("答案较短，已添加补充说明")
    
    state["processing_steps"].append("答案验证完成")
    
    return state

# 创建RAG工作流
rag_workflow = StateGraph(RAGState)
rag_workflow.add_node("retrieve", document_retrieval_node)
rag_workflow.add_node("generate", answer_generation_node)
rag_workflow.add_node("validate", answer_validation_node)

rag_workflow.add_edge(START, "retrieve")
rag_workflow.add_edge("retrieve", "generate")
rag_workflow.add_edge("generate", "validate")
rag_workflow.add_edge("validate", END)

rag_app = rag_workflow.compile()

print("RAG系统创建完成")

### 1.3 测试RAG系统

In [None]:
def test_rag_system(queries: List[str]):
    """测试RAG系统"""
    for query in queries:
        print(f"\n{'='*60}")
        print(f"查询: {query}")
        print(f"{'='*60}")
        
        # 创建初始状态
        initial_state = {
            "query": query,
            "retrieved_docs": [],
            "context": "",
            "answer": "",
            "confidence": 0.0,
            "sources": [],
            "processing_steps": []
        }
        
        # 执行RAG流程
        result = rag_app.invoke(initial_state)
        
        # 显示结果
        print("\n🔍 处理步骤:")
        for step in result["processing_steps"]:
            print(f"  • {step}")
        
        print(f"\n📚 检索到的文档数量: {len(result['retrieved_docs'])}")
        print(f"📄 参考来源: {', '.join(result['sources'])}")
        print(f"🎯 置信度: {result['confidence']:.2f}")
        
        print(f"\n💬 答案:")
        print(result["answer"])
        
        print(f"\n📋 检索的文档内容:")
        for i, doc in enumerate(result["retrieved_docs"], 1):
            print(f"  {i}. [{doc['metadata']['source']}] {doc['content'][:100]}...")

# 测试查询
test_queries = [
    "什么是LangGraph？",
    "StateGraph如何工作？",
    "检查点机制有什么用？",
    "如何实现多Agent协作？"
]

test_rag_system(test_queries)

## 2. 多Agent协作系统

### 2.1 智能客服系统

本节展示如何构建一个多Agent协作的智能客服系统，不同的Agent负责处理不同类型的客户请求。

In [None]:
@dataclass
class CustomerRequest:
    """客户请求"""
    request_id: str
    customer_id: str
    category: str
    priority: str
    content: str
    timestamp: str
    status: str = "pending"
    assigned_agent: Optional[str] = None
    resolution: Optional[str] = None

@dataclass
class Agent:
    """Agent基类"""
    agent_id: str
    name: str
    specialization: List[str]
    availability: bool = True
    current_load: int = 0
    max_load: int = 3

# 定义不同类型的Agent
class TechnicalSupportAgent(Agent):
    """技术支持Agent"""
    
    def handle_request(self, request: CustomerRequest) -> Dict[str, Any]:
        """处理技术支持请求"""
        if "登录" in request.content or "密码" in request.content:
            resolution = "请尝试以下步骤：1) 清除浏览器缓存 2) 重置密码 3) 检查网络连接。如问题持续存在，请联系系统管理员。"
            confidence = 0.9
        elif "错误" in request.content or "bug" in request.content.lower():
            resolution = "我们已记录此错误报告。技术团队将在24小时内调查并提供解决方案。感谢您的反馈。"
            confidence = 0.8
        elif "性能" in request.content or "慢" in request.content:
            resolution = "性能问题可能由多种因素引起。建议：1) 关闭不必要的应用程序 2) 检查网络速度 3) 清理系统垃圾文件。"
            confidence = 0.85
        else:
            resolution = "您的技术问题已收到。我们的专家团队将详细分析并在2个工作日内回复您。"
            confidence = 0.7
        
        return {
            "resolution": resolution,
            "confidence": confidence,
            "estimated_time": "2-24小时",
            "follow_up_needed": confidence < 0.8
        }

class BillingAgent(Agent):
    """账单Agent"""
    
    def handle_request(self, request: CustomerRequest) -> Dict[str, Any]:
        """处理账单相关请求"""
        if "发票" in request.content or "账单" in request.content:
            resolution = "您的账单信息已发送至注册邮箱。如未收到，请检查垃圾邮件文件夹或联系我们重新发送。"
            confidence = 0.95
        elif "付款" in request.content or "支付" in request.content:
            resolution = "我们支持多种支付方式：信用卡、银行转账、支付宝、微信支付。付款后通常1-3个工作日到账。"
            confidence = 0.9
        elif "退款" in request.content:
            resolution = "退款申请已提交至财务部门。根据支付方式，退款将在5-10个工作日内处理完成。"
            confidence = 0.85
        else:
            resolution = "您的账单相关问题已收到。我们的财务团队将在1个工作日内为您核实并回复。"
            confidence = 0.8
        
        return {
            "resolution": resolution,
            "confidence": confidence,
            "estimated_time": "1-5个工作日",
            "follow_up_needed": "退款" in request.content
        }

class GeneralInquiryAgent(Agent):
    """一般咨询Agent"""
    
    def handle_request(self, request: CustomerRequest) -> Dict[str, Any]:
        """处理一般咨询"""
        if "产品" in request.content or "功能" in request.content:
            resolution = "感谢您对我们产品的关注。我们的产品功能丰富，包括核心功能A、B、C。如需详细了解，建议预约产品演示。"
            confidence = 0.8
        elif "价格" in request.content or "费用" in request.content:
            resolution = "我们提供多种套餐选择：基础版、专业版、企业版。具体价格因功能和使用量而异，建议联系销售顾问获取定制报价。"
            confidence = 0.85
        elif "合作" in request.content or "合同" in request.content:
            resolution = "感谢您的合作意向。我们的商务团队将联系您讨论具体合作方案和条款。"
            confidence = 0.9
        else:
            resolution = "谢谢您的咨询。我们已收到您的问题，客服代表将在4小时内与您联系。"
            confidence = 0.75
        
        return {
            "resolution": resolution,
            "confidence": confidence,
            "estimated_time": "4小时内",
            "follow_up_needed": confidence < 0.8
        }

print("多Agent系统类定义完成")

### 2.2 多Agent协作工作流

In [None]:
# 创建Agent实例
agents = {
    "tech_support": TechnicalSupportAgent(
        agent_id="tech_001",
        name="技术支持专家",
        specialization=["technical", "bug", "performance"]
    ),
    "billing": BillingAgent(
        agent_id="billing_001",
        name="账单专员",
        specialization=["billing", "payment", "invoice"]
    ),
    "general": GeneralInquiryAgent(
        agent_id="general_001",
        name="客服代表",
        specialization=["general", "inquiry", "product"]
    )
}

# 多Agent系统状态
class MultiAgentState(TypedDict):
    request: CustomerRequest
    assigned_agent_type: Optional[str]
    agent_response: Optional[Dict[str, Any]]
    escalation_needed: bool
    processing_log: List[str]
    final_resolution: Optional[str]

def request_classifier_node(state: MultiAgentState) -> MultiAgentState:
    """请求分类节点"""
    request = state["request"]
    content = request.content.lower()
    
    state["processing_log"].append(f"开始分类请求: {request.request_id}")
    
    # 基于关键词分类
    if any(keyword in content for keyword in ["错误", "bug", "登录", "密码", "性能", "慢"]):
        agent_type = "tech_support"
        request.category = "technical"
    elif any(keyword in content for keyword in ["账单", "发票", "付款", "支付", "退款"]):
        agent_type = "billing"
        request.category = "billing"
    else:
        agent_type = "general"
        request.category = "general"
    
    # 设置优先级
    if any(keyword in content for keyword in ["紧急", "严重", "无法使用"]):
        request.priority = "high"
    elif any(keyword in content for keyword in ["重要", "尽快"]):
        request.priority = "medium"
    else:
        request.priority = "low"
    
    state["assigned_agent_type"] = agent_type
    state["processing_log"].append(f"请求分类为: {request.category}, 优先级: {request.priority}, 分配给: {agent_type}")
    
    return state

def agent_processing_node(state: MultiAgentState) -> MultiAgentState:
    """Agent处理节点"""
    request = state["request"]
    agent_type = state["assigned_agent_type"]
    
    state["processing_log"].append(f"Agent {agent_type} 开始处理请求")
    
    # 获取对应的Agent
    agent = agents[agent_type]
    
    # 检查Agent可用性
    if agent.current_load >= agent.max_load:
        state["processing_log"].append(f"Agent {agent_type} 负载已满，请求加入队列")
        # 在实际系统中，这里会有队列管理逻辑
        agent_response = {
            "resolution": "当前咨询量较大，您的请求已加入队列，我们会尽快为您处理。",
            "confidence": 0.6,
            "estimated_time": "1-2小时",
            "follow_up_needed": True
        }
    else:
        # Agent处理请求
        agent.current_load += 1
        agent_response = agent.handle_request(request)
        agent.current_load -= 1
        
        state["processing_log"].append(f"Agent处理完成，置信度: {agent_response['confidence']}")
    
    state["agent_response"] = agent_response
    request.assigned_agent = agent.agent_id
    request.status = "processed"
    
    return state

def quality_assurance_node(state: MultiAgentState) -> MultiAgentState:
    """质量保证节点"""
    agent_response = state["agent_response"]
    request = state["request"]
    
    state["processing_log"].append("开始质量检查")
    
    # 质量检查逻辑
    confidence = agent_response["confidence"]
    follow_up_needed = agent_response["follow_up_needed"]
    
    if confidence < 0.7 or follow_up_needed or request.priority == "high":
        state["escalation_needed"] = True
        state["processing_log"].append("需要升级处理")
        
        # 添加升级处理信息
        escalation_note = "\n\n[系统提醒] 此问题已升级至专业团队，我们会安排资深专家跟进处理。"
        final_resolution = agent_response["resolution"] + escalation_note
    else:
        state["escalation_needed"] = False
        final_resolution = agent_response["resolution"]
        state["processing_log"].append("质量检查通过")
    
    state["final_resolution"] = final_resolution
    request.resolution = final_resolution
    request.status = "resolved" if not state["escalation_needed"] else "escalated"
    
    return state

# 创建多Agent工作流
multi_agent_workflow = StateGraph(MultiAgentState)
multi_agent_workflow.add_node("classify", request_classifier_node)
multi_agent_workflow.add_node("process", agent_processing_node)
multi_agent_workflow.add_node("qa", quality_assurance_node)

multi_agent_workflow.add_edge(START, "classify")
multi_agent_workflow.add_edge("classify", "process")
multi_agent_workflow.add_edge("process", "qa")
multi_agent_workflow.add_edge("qa", END)

multi_agent_app = multi_agent_workflow.compile()

print("多Agent客服系统创建完成")

### 2.3 测试多Agent系统

In [None]:
def test_multi_agent_system():
    """测试多Agent系统"""
    
    # 创建测试请求
    test_requests = [
        CustomerRequest(
            request_id="REQ001",
            customer_id="CUST001",
            category="",
            priority="",
            content="我无法登录系统，总是显示密码错误，这很紧急！",
            timestamp=datetime.now().isoformat()
        ),
        CustomerRequest(
            request_id="REQ002",
            customer_id="CUST002",
            category="",
            priority="",
            content="我需要这个月的发票，请发送到我的邮箱",
            timestamp=datetime.now().isoformat()
        ),
        CustomerRequest(
            request_id="REQ003",
            customer_id="CUST003",
            category="",
            priority="",
            content="请问你们的产品有哪些功能？价格如何？",
            timestamp=datetime.now().isoformat()
        )
    ]
    
    for request in test_requests:
        print(f"\n{'='*70}")
        print(f"处理客户请求: {request.request_id}")
        print(f"客户ID: {request.customer_id}")
        print(f"请求内容: {request.content}")
        print(f"{'='*70}")
        
        # 创建初始状态
        initial_state = {
            "request": request,
            "assigned_agent_type": None,
            "agent_response": None,
            "escalation_needed": False,
            "processing_log": [],
            "final_resolution": None
        }
        
        # 执行多Agent处理流程
        result = multi_agent_app.invoke(initial_state)
        
        # 显示处理结果
        print("\n📋 处理流程:")
        for log in result["processing_log"]:
            print(f"  • {log}")
        
        processed_request = result["request"]
        agent_response = result["agent_response"]
        
        print(f"\n🏷️  分类结果: {processed_request.category}")
        print(f"⚡ 优先级: {processed_request.priority}")
        print(f"👤 分配Agent: {result['assigned_agent_type']}")
        print(f"📊 置信度: {agent_response['confidence']:.2f}")
        print(f"⏱️  预计处理时间: {agent_response['estimated_time']}")
        print(f"🔄 是否需要升级: {'是' if result['escalation_needed'] else '否'}")
        print(f"✅ 最终状态: {processed_request.status}")
        
        print(f"\n💬 客服回复:")
        print(result["final_resolution"])

# 运行测试
test_multi_agent_system()

## 3. 生产环境最佳实践

### 3.1 错误处理和重试机制

In [None]:
import functools
import random
from enum import Enum

class RetryPolicy(Enum):
    """重试策略"""
    NO_RETRY = "no_retry"
    FIXED_DELAY = "fixed_delay"
    EXPONENTIAL_BACKOFF = "exponential_backoff"

@dataclass
class RetryConfig:
    """重试配置"""
    max_attempts: int = 3
    base_delay: float = 1.0
    max_delay: float = 60.0
    policy: RetryPolicy = RetryPolicy.EXPONENTIAL_BACKOFF

def with_retry(retry_config: RetryConfig):
    """重试装饰器"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(retry_config.max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    
                    # 如果是最后一次尝试，抛出异常
                    if attempt == retry_config.max_attempts - 1:
                        break
                    
                    # 计算延迟时间
                    if retry_config.policy == RetryPolicy.FIXED_DELAY:
                        delay = retry_config.base_delay
                    elif retry_config.policy == RetryPolicy.EXPONENTIAL_BACKOFF:
                        delay = min(retry_config.base_delay * (2 ** attempt), retry_config.max_delay)
                    else:
                        delay = 0
                    
                    print(f"尝试 {attempt + 1} 失败: {str(e)}, {delay}秒后重试...")
                    time.sleep(delay)
            
            # 所有尝试都失败了
            raise last_exception
        
        return wrapper
    return decorator

# 自定义异常类
class NetworkError(Exception):
    """网络错误"""
    pass

class ServiceUnavailableError(Exception):
    """服务不可用错误"""
    pass

# 带重试机制的服务类
class RobustService:
    """具有重试机制的健壮服务"""
    
    def __init__(self):
        self.call_count = 0
        self.failure_rate = 0.3  # 30%的失败率用于模拟
    
    @with_retry(RetryConfig(
        max_attempts=3,
        base_delay=1.0,
        policy=RetryPolicy.EXPONENTIAL_BACKOFF
    ))
    def call_external_api(self, data: Dict[str, Any]) -> Dict[str, Any]:
        """调用外部API"""
        self.call_count += 1
        print(f"调用外部API，第 {self.call_count} 次尝试")
        
        # 模拟错误情况
        if random.random() < self.failure_rate:
            error_type = random.choice([NetworkError, ServiceUnavailableError])
            if error_type == NetworkError:
                raise NetworkError("网络连接超时")
            else:
                raise ServiceUnavailableError("外部服务暂时不可用")
        
        # 成功情况
        return {
            "status": "success",
            "data": f"处理成功，调用次数: {self.call_count}",
            "timestamp": datetime.now().isoformat()
        }

print("错误处理和重试机制设置完成")

### 3.2 性能监控系统

In [None]:
from collections import defaultdict, deque
from contextlib import contextmanager

@dataclass
class PerformanceMetrics:
    """性能指标"""
    node_name: str
    execution_time: float
    success: bool
    timestamp: str
    error_message: Optional[str] = None

class PerformanceMonitor:
    """性能监控器"""
    
    def __init__(self, max_history=1000):
        self.metrics_history = deque(maxlen=max_history)
        self.node_metrics = defaultdict(list)
        self.alerts = []
        self.thresholds = {
            "max_execution_time": 10.0,  # 秒
            "max_error_rate": 0.1        # 10%
        }
    
    @contextmanager
    def monitor_node(self, node_name: str):
        """节点性能监控上下文管理器"""
        start_time = time.time()
        success = True
        error_message = None
        
        try:
            yield
        except Exception as e:
            success = False
            error_message = str(e)
            raise
        finally:
            # 计算性能指标
            execution_time = time.time() - start_time
            
            metrics = PerformanceMetrics(
                node_name=node_name,
                execution_time=execution_time,
                success=success,
                timestamp=datetime.now().isoformat(),
                error_message=error_message
            )
            
            self.record_metrics(metrics)
    
    def record_metrics(self, metrics: PerformanceMetrics):
        """记录性能指标"""
        self.metrics_history.append(metrics)
        self.node_metrics[metrics.node_name].append(metrics)
        
        # 检查是否触发告警
        self._check_alerts(metrics)
    
    def _check_alerts(self, metrics: PerformanceMetrics):
        """检查告警条件"""
        alerts = []
        
        # 执行时间告警
        if metrics.execution_time > self.thresholds["max_execution_time"]:
            alerts.append({
                "type": "high_execution_time",
                "node": metrics.node_name,
                "value": metrics.execution_time,
                "threshold": self.thresholds["max_execution_time"],
                "timestamp": metrics.timestamp
            })
        
        # 检查错误率
        node_history = self.node_metrics[metrics.node_name]
        if len(node_history) >= 10:  # 至少10个样本才计算错误率
            recent_metrics = node_history[-10:]
            error_rate = sum(1 for m in recent_metrics if not m.success) / len(recent_metrics)
            
            if error_rate > self.thresholds["max_error_rate"]:
                alerts.append({
                    "type": "high_error_rate",
                    "node": metrics.node_name,
                    "value": error_rate,
                    "threshold": self.thresholds["max_error_rate"],
                    "timestamp": metrics.timestamp
                })
        
        # 添加告警
        for alert in alerts:
            self.alerts.append(alert)
            self._send_alert(alert)
    
    def _send_alert(self, alert: Dict[str, Any]):
        """发送告警"""
        print(f"🚨 ALERT: {alert['type']} - 节点: {alert['node']}, 值: {alert['value']}, 阈值: {alert['threshold']}")
    
    def get_node_statistics(self, node_name: str) -> Dict[str, Any]:
        """获取节点统计信息"""
        metrics = self.node_metrics[node_name]
        if not metrics:
            return {}
        
        execution_times = [m.execution_time for m in metrics]
        success_count = sum(1 for m in metrics if m.success)
        
        return {
            "total_calls": len(metrics),
            "success_rate": success_count / len(metrics) if metrics else 0,
            "avg_execution_time": sum(execution_times) / len(execution_times),
            "max_execution_time": max(execution_times),
            "last_updated": metrics[-1].timestamp
        }
    
    def generate_report(self) -> str:
        """生成性能报告"""
        report = ["\n" + "="*60]
        report.append("性能监控报告")
        report.append("="*60)
        
        # 总体统计
        total_metrics = len(self.metrics_history)
        if total_metrics > 0:
            success_count = sum(1 for m in self.metrics_history if m.success)
            overall_success_rate = success_count / total_metrics
            
            report.append(f"\n📊 总体统计:")
            report.append(f"  总调用次数: {total_metrics}")
            report.append(f"  总体成功率: {overall_success_rate:.2%}")
            report.append(f"  告警数量: {len(self.alerts)}")
        
        # 各节点统计
        report.append(f"\n🔍 节点性能统计:")
        for node_name in self.node_metrics:
            stats = self.get_node_statistics(node_name)
            if stats:
                report.append(f"\n  节点: {node_name}")
                report.append(f"    调用次数: {stats['total_calls']}")
                report.append(f"    成功率: {stats['success_rate']:.2%}")
                report.append(f"    平均执行时间: {stats['avg_execution_time']:.3f}秒")
                report.append(f"    最大执行时间: {stats['max_execution_time']:.3f}秒")
        
        report.append("\n" + "="*60)
        return "\n".join(report)

# 创建全局性能监控器
performance_monitor = PerformanceMonitor()

print("性能监控系统设置完成")

### 3.3 生产环境测试

In [None]:
# 生产环境状态管理
class ProductionState(TypedDict):
    task_id: str
    input_data: Dict[str, Any]
    processing_results: List[Dict[str, Any]]
    errors: List[Dict[str, Any]]

def monitored_node(node_name: str):
    """性能监控装饰器"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(state):
            with performance_monitor.monitor_node(node_name):
                return func(state)
        return wrapper
    return decorator

@monitored_node("data_processing")
def robust_data_processing_node(state: ProductionState) -> ProductionState:
    """健壮的数据处理节点"""
    try:
        service = RobustService()
        
        # 处理数据
        result = service.call_external_api(state["input_data"])
        
        state["processing_results"].append({
            "node": "data_processing",
            "status": "success",
            "result": result
        })
        
        print(f"✅ data_processing 处理成功")
        
    except Exception as e:
        # 记录错误但继续处理
        error_info = {
            "node": "data_processing",
            "type": type(e).__name__,
            "message": str(e),
            "timestamp": datetime.now().isoformat()
        }
        state["errors"].append(error_info)
        print(f"❌ data_processing 处理失败: {str(e)}")
    
    return state

@monitored_node("data_validation")
def validation_node(state: ProductionState) -> ProductionState:
    """数据验证节点"""
    print(f"执行数据验证: {state['task_id']}")
    
    # 模拟验证逻辑
    time.sleep(0.1)
    
    # 模拟随机错误
    if random.random() < 0.1:  # 10%错误率
        raise ValueError("数据验证失败")
    
    state["processing_results"].append({
        "node": "data_validation",
        "result": "验证通过",
        "status": "success"
    })
    
    return state

# 创建生产环境测试工作流
production_workflow = StateGraph(ProductionState)
production_workflow.add_node("validation", validation_node)
production_workflow.add_node("processing", robust_data_processing_node)

production_workflow.add_edge(START, "validation")
production_workflow.add_edge("validation", "processing")
production_workflow.add_edge("processing", END)

production_app = production_workflow.compile()

def run_production_test(num_tasks: int = 10):
    """运行生产环境测试"""
    print(f"开始生产环境测试，执行 {num_tasks} 个任务...\n")
    
    successful_tasks = 0
    failed_tasks = 0
    
    for i in range(num_tasks):
        task_id = f"TASK_{i+1:03d}"
        
        # 创建初始状态
        initial_state = {
            "task_id": task_id,
            "input_data": {"data": f"测试数据_{i+1}"},
            "processing_results": [],
            "errors": []
        }
        
        try:
            print(f"执行任务 {task_id}...", end=" ")
            
            # 执行工作流
            result = production_app.invoke(initial_state)
            
            if result["errors"]:
                failed_tasks += 1
                print(f"❌ 失败 ({len(result['errors'])} 个错误)")
            else:
                successful_tasks += 1
                print(f"✅ 成功")
                
        except Exception as e:
            failed_tasks += 1
            print(f"❌ 异常: {str(e)}")
        
        # 短暂延迟模拟实际负载
        time.sleep(0.1)
    
    print(f"\n测试完成！")
    print(f"成功任务: {successful_tasks}/{num_tasks} ({successful_tasks/num_tasks:.1%})")
    print(f"失败任务: {failed_tasks}/{num_tasks} ({failed_tasks/num_tasks:.1%})")
    
    # 生成性能报告
    print(performance_monitor.generate_report())

# 运行生产环境测试
run_production_test(15)

## 4. 最佳实践总结

### 4.1 架构设计原则

In [None]:
# 最佳实践总结
best_practices = {
    "架构设计": {
        "模块化设计": "将复杂系统分解为独立的、可重用的组件",
        "单一职责": "每个节点应该有明确的单一职责",
        "松耦合": "通过状态传递和事件系统实现组件间的松耦合",
        "可扩展性": "设计时考虑系统的横向和纵向扩展能力"
    },
    "错误处理": {
        "优雅降级": "系统应该能够在部分功能失效时继续运行",
        "重试机制": "对于临时性错误实现指数退避重试",
        "错误分类": "区分可重试和不可重试的错误类型",
        "错误记录": "详细记录错误信息用于调试和分析"
    },
    "性能优化": {
        "异步处理": "对于I/O密集型任务使用异步处理",
        "并行执行": "充分利用并行处理能力提高吞吐量",
        "资源管理": "合理管理内存和CPU资源使用",
        "缓存策略": "实现合适的缓存机制减少重复计算"
    },
    "监控和运维": {
        "指标收集": "收集关键性能和业务指标",
        "告警机制": "设置合理的告警阈值和通知机制",
        "日志管理": "实现结构化日志记录和分析",
        "健康检查": "实现系统健康状态检查和自动恢复"
    }
}

def print_best_practices():
    """打印最佳实践指南"""
    print("\n" + "="*70)
    print("LangGraph 生产环境最佳实践指南")
    print("="*70)
    
    for category, practices in best_practices.items():
        print(f"\n🏗️ {category}:")
        for principle, description in practices.items():
            print(f"  • {principle}: {description}")
    
    print("\n" + "="*70)
    print("应用场景建议")
    print("="*70)
    
    scenarios = {
        "RAG系统": "适用于知识密集型应用，如智能客服、文档问答、专家系统",
        "多Agent协作": "适用于复杂业务流程，如订单处理、审批工作流、多步骤分析",
        "实时处理": "适用于需要实时响应的场景，如监控系统、实时推荐、在线诊断",
        "批处理工作流": "适用于大规模数据处理，如ETL流程、报表生成、数据同步"
    }
    
    for scenario, description in scenarios.items():
        print(f"\n📋 {scenario}:")
        print(f"  {description}")
    
    print("\n" + "="*70)
    print("部署建议")
    print("="*70)
    
    deployment_tips = [
        "使用容器化技术（Docker/Kubernetes）进行部署",
        "实现蓝绿部署或滚动更新策略",
        "配置负载均衡和自动扩缩容",
        "建立完整的CI/CD流水线",
        "实施全面的监控和日志管理",
        "定期进行性能测试和压力测试",
        "建立灾备和数据恢复机制",
        "维护详细的运维文档和应急预案"
    ]
    
    for tip in deployment_tips:
        print(f"  ✓ {tip}")

print_best_practices()

## 5. 实践练习

### 练习1：构建智能推荐系统
基于用户行为和内容特征，构建一个多阶段的推荐系统：
- 用户画像分析
- 候选物品召回
- 特征工程和排序
- 多样性和新颖性优化
- A/B测试和效果评估

### 练习2：实现分布式数据处理管道
设计一个可扩展的数据处理管道：
- 数据摄取和验证
- 并行数据清洗和转换
- 特征提取和聚合
- 数据质量检查
- 结果存储和通知

### 练习3：创建智能运维系统
构建一个自动化运维系统：
- 系统监控和告警
- 异常检测和分析
- 自动故障诊断
- 修复建议和执行
- 性能优化建议

## 6. 总结

本教程全面介绍了LangGraph的高级应用案例和生产环境最佳实践：

### 主要内容回顾

1. **RAG系统构建**
   - 智能文档检索和问答
   - 向量数据库集成
   - 上下文管理和答案生成
   - 质量评估和验证机制

2. **多Agent协作系统**
   - 智能客服系统设计
   - 请求分类和路由
   - Agent专业化和协调
   - 质量保证和升级机制

3. **生产环境部署**
   - 错误处理和重试机制
   - 性能监控和指标收集
   - 告警系统和自动化运维
   - 可扩展性和可靠性设计

### 核心价值

- **可扩展性**: 通过模块化设计实现系统的横向和纵向扩展
- **可靠性**: 通过错误处理、重试机制和监控确保系统稳定运行
- **可维护性**: 通过清晰的架构设计和完善的日志监控降低运维成本
- **性能优化**: 通过异步处理、并行执行和缓存策略提高系统性能
- **用户体验**: 通过智能化处理和实时反馈提升用户满意度

### 发展方向

随着AI技术的不断发展，LangGraph将在以下方面继续演进：
- 更强大的多模态处理能力
- 更智能的自动化编排
- 更完善的云原生支持
- 更丰富的生态系统集成

通过本教程的学习，您已经掌握了使用LangGraph构建企业级AI应用的核心技能。继续实践和探索，您将能够构建出更加智能和高效的系统解决方案。