# Module 3: LangChain

## Applied AI Scientist Field Notes - Expanded Edition

---


## Module 3: LangChain - Production Chains, Agents, and Evaluation

### Topics
1. Chain design patterns
2. Error handling and retries
3. Memory management
4. Tool use and function calling
5. Agent architectures
6. Evaluation frameworks

---

In [None]:
%pip install -q langchain langchain-community langchain-openai
%pip install -q pydantic

print('LangChain installed!')

### Section 1: Robust Chain Design

Production chains need:
- Structured output parsing
- Automatic retries
- Validation
- Error recovery
- Metrics tracking

In [None]:
from typing import Callable, Any
from pydantic import BaseModel, Field
import time

class RobustChain:
    '''Production chain with retry and observability'''
    
    def __init__(self, llm_func: Callable, parser: Callable, max_retries=3):
        self.llm_func = llm_func
        self.parser = parser
        self.max_retries = max_retries
        self.metrics = {'calls': 0, 'retries': 0, 'failures': 0, 'total_latency': 0}
    
    def invoke(self, prompt: str) -> Any:
        '''Execute with retry logic'''
        self.metrics['calls'] += 1
        start = time.time()
        
        for attempt in range(self.max_retries):
            try:
                response = self.llm_func(prompt)
                result = self.parser(response)
                
                self.metrics['total_latency'] += (time.time() - start)
                return result
                
            except Exception as e:
                self.metrics['retries'] += 1
                if attempt == self.max_retries - 1:
                    self.metrics['failures'] += 1
                    raise RuntimeError(f'Chain failed after {self.max_retries} attempts: {e}')
                
                prompt += f'\n\n[Error in previous attempt: {e}. Fix the output.]'
                time.sleep(0.5)
    
    def get_metrics(self):
        return {
            **self.metrics,
            'success_rate': 1 - (self.metrics['failures'] / self.metrics['calls']) if self.metrics['calls'] > 0 else 0,
            'avg_latency': self.metrics['total_latency'] / self.metrics['calls'] if self.metrics['calls'] > 0 else 0
        }

# Example
class SentimentOutput(BaseModel):
    sentiment: str = Field(description='positive, negative, or neutral')
    confidence: float = Field(ge=0.0, le=1.0)

def mock_llm(prompt):
    return '{"sentiment": "positive", "confidence": 0.85}'

def parser(response):
    import json
    data = json.loads(response)
    return SentimentOutput(**data)

chain = RobustChain(mock_llm, parser)
result = chain.invoke('This product is great!')
print(f'Result: {result}')
print(f'Metrics: {chain.get_metrics()}')

### Section 2: Memory Management for Stateful Conversations

Production conversational systems require efficient memory management to maintain context across turns while controlling costs.


In [None]:
from collections import deque
import json

class ConversationMemory:
    '''Production conversation memory with token budgeting'''
    
    def __init__(self, max_tokens: int = 2000, summarization_threshold: int = 1500):
        self.max_tokens = max_tokens
        self.summarization_threshold = summarization_threshold
        self.messages = deque()
        self.summary = None
        self.token_count = 0
    
    def add_message(self, role: str, content: str):
        '''Add message with automatic summarization'''
        import tiktoken
        enc = tiktoken.encoding_for_model('gpt-4')
        
        msg_tokens = len(enc.encode(content))
        
        # Check if we need to summarize
        if self.token_count + msg_tokens > self.summarization_threshold:
            self._summarize_old_messages()
        
        self.messages.append({'role': role, 'content': content, 'tokens': msg_tokens})
        self.token_count += msg_tokens
        
        # Enforce hard limit
        while self.token_count > self.max_tokens and len(self.messages) > 1:
            removed = self.messages.popleft()
            self.token_count -= removed['tokens']
    
    def _summarize_old_messages(self):
        '''Summarize older messages to save tokens'''
        # Keep last 3 messages, summarize the rest
        if len(self.messages) <= 3:
            return
        
        to_summarize = list(self.messages)[:-3]
        conversation_text = '\\n'.join([
            f"{msg['role']}: {msg['content']}" for msg in to_summarize
        ])
        
        # Mock LLM call for summarization
        summary = f"Summary: Discussed {len(to_summarize)} topics including..."
        
        # Clear old messages, keep summary
        for _ in range(len(to_summarize)):
            removed = self.messages.popleft()
            self.token_count -= removed['tokens']
        
        self.summary = summary
        self.token_count += 100  # Approximate summary tokens
    
    def get_context(self) -> List[dict]:
        '''Get conversation context for LLM'''
        context = []
        
        if self.summary:
            context.append({'role': 'system', 'content': self.summary})
        
        context.extend(list(self.messages))
        return context
    
    def get_metrics(self) -> dict:
        '''Memory usage metrics'''
        return {
            'total_messages': len(self.messages),
            'token_count': self.token_count,
            'token_utilization': self.token_count / self.max_tokens,
            'has_summary': self.summary is not None
        }

# Demo
memory = ConversationMemory(max_tokens=500, summarization_threshold=300)

conversation = [
    ('user', 'Tell me about Python'),
    ('assistant', 'Python is a high-level programming language known for its simplicity and readability.'),
    ('user', 'What about its history?'),
    ('assistant', 'Python was created by Guido van Rossum and first released in 1991.'),
    ('user', 'What are its main uses?'),
    ('assistant', 'Python is widely used in web development, data science, machine learning, automation, and more.'),
]

for role, content in conversation:
    memory.add_message(role, content)
    metrics = memory.get_metrics()
    print(f"Added {role} message | Tokens: {metrics['token_count']}/{memory.max_tokens} | Messages: {metrics['total_messages']}")

print(f"\nFinal context: {len(memory.get_context())} messages")
print(f"Token utilization: {memory.get_metrics()['token_utilization']:.1%}")


## Interview Questions: LangChain & Production Chains

### For Experienced Professionals

Production chain systems require robust error handling, efficient memory management, and comprehensive evaluation.


In [None]:
interview_questions_langchain = [
    {
        "level": "Senior",
        "question": "Your LangChain agent makes 5-10 tool calls per query, costing $0.15-$0.30. Business wants to reduce cost by 50% without sacrificing quality. Walk through your optimization approach.",
        "answer": """
**Current Cost Breakdown:**
- Base query: $0.03 (input) + $0.06 (output) = $0.09
- 5-10 tool calls: Each call adds $0.01-$0.03
- Total: $0.15-$0.30 per query

**Target: $0.075-$0.15 (50% reduction)**

**Root Cause Analysis:**

1. **Profile Tool Usage:**
```python
def profile_agent_execution(agent, queries: List[str]):
    tool_usage = defaultdict(int)
    redundant_calls = 0
    
    for query in queries:
        trace = agent.run_with_trace(query)
        
        # Track which tools are called
        for call in trace['tool_calls']:
            tool_usage[call['tool']] += 1
            
            # Check for redundant calls (same tool, similar args)
            if is_redundant(call, trace['tool_calls']):
                redundant_calls += 1
    
    return {
        'tool_usage': dict(tool_usage),
        'avg_calls_per_query': sum(tool_usage.values()) / len(queries),
        'redundant_call_rate': redundant_calls / sum(tool_usage.values())
    }

# Example output:
# {
#   'tool_usage': {'search': 120, 'calculator': 80, 'database': 50},
#   'avg_calls_per_query': 7.5,
#   'redundant_call_rate': 0.15  # 15% of calls are redundant
# }
```

**Optimization Strategy:**

**1. Caching Tool Results (Immediate, ~25% cost reduction):**
```python
import hashlib
from functools import lru_cache

class CachedToolExecutor:
    def __init__(self):
        self.cache = {}  # In production: Redis with TTL
        self.cache_hits = 0
        self.cache_misses = 0
    
    def execute_tool(self, tool_name: str, args: dict) -> Any:
        # Create cache key from tool + args
        cache_key = self._get_cache_key(tool_name, args)
        
        if cache_key in self.cache:
            self.cache_hits += 1
            return self.cache[cache_key]
        
        # Execute tool
        result = self._run_tool(tool_name, args)
        
        # Cache result
        self.cache[cache_key] = result
        self.cache_misses += 1
        
        return result
    
    def _get_cache_key(self, tool_name: str, args: dict) -> str:
        # Deterministic key from tool + args
        key_str = f"{tool_name}:{json.dumps(args, sort_keys=True)}"
        return hashlib.md5(key_str.encode()).hexdigest()
    
    def get_metrics(self):
        total = self.cache_hits + self.cache_misses
        return {
            'cache_hit_rate': self.cache_hits / total if total > 0 else 0,
            'cost_saved_pct': (self.cache_hits / total) * 100 if total > 0 else 0
        }

# Expected: 20-30% cache hit rate → 20-30% cost reduction on tool calls
```

**2. Plan-and-Execute Pattern (Medium-term, ~40% cost reduction):**
```python
class PlanAndExecuteAgent:
    '''Generate plan first, execute efficiently'''
    
    def run(self, query: str) -> str:
        # Step 1: Generate execution plan (1 LLM call, cheap with low temp)
        plan = self.generate_plan(query)
        # Example plan: 
        # 1. Search for "Q4 revenue"
        # 2. Extract number from results
        # 3. Calculate YoY growth
        
        # Step 2: Execute plan steps (fewer, more targeted tool calls)
        results = {}
        for step in plan['steps']:
            tool = step['tool']
            args = step['args']
            results[step['id']] = self.execute_tool(tool, args)
        
        # Step 3: Synthesize final answer (1 LLM call)
        answer = self.synthesize_answer(query, plan, results)
        
        return answer
    
    def generate_plan(self, query: str) -> dict:
        '''Generate execution plan with low temperature'''
        prompt = f'''
        Query: {query}
        
        Generate a step-by-step plan to answer this query.
        Available tools: search, calculator, database_query
        
        Return JSON:
        {{
            "steps": [
                {{"id": 1, "action": "description", "tool": "tool_name", "args": {{...}} }}
            ]
        }}
        '''
        
        # Use low temperature for deterministic planning
        response = llm.generate(prompt, temperature=0.2)
        return json.loads(response)
    
    def synthesize_answer(self, query: str, plan: dict, results: dict) -> str:
        '''Combine results into final answer'''
        prompt = f'''
        Query: {query}
        
        Execution results:
        {json.dumps(results, indent=2)}
        
        Synthesize a clear answer to the query.
        '''
        
        return llm.generate(prompt, temperature=0.5)

# Comparison:
# ReAct agent (traditional):
#   - Query → Tool call → Observation → Think → Tool call → ...
#   - 5-10 LLM calls, 5-10 tool calls
#   - Cost: $0.15-$0.30
#
# Plan-and-Execute:
#   - Query → Plan (1 LLM call) → Execute tools (3-5 calls) → Synthesize (1 LLM call)
#   - 2 LLM calls, 3-5 tool calls
#   - Cost: $0.08-$0.12 (60% of original)
```

**3. Tool Fusion (Long-term, ~50% cost reduction):**
```python
class FusedToolExecutor:
    '''Combine multiple tool calls into single batch operation'''
    
    def batch_execute(self, tool_calls: List[dict]) -> List[Any]:
        '''Execute multiple tool calls in parallel'''
        # Group by tool type
        by_tool = defaultdict(list)
        for call in tool_calls:
            by_tool[call['tool']].append(call)
        
        # Execute in batches
        results = {}
        for tool_name, calls in by_tool.items():
            if tool_name == 'search':
                # Batch search: 1 API call for multiple queries
                queries = [c['args']['query'] for c in calls]
                batch_results = self.search_batch(queries)
                for call, result in zip(calls, batch_results):
                    results[call['id']] = result
            
            elif tool_name == 'database_query':
                # Batch DB queries with UNION
                batch_results = self.database_batch([c['args'] for c in calls])
                for call, result in zip(calls, batch_results):
                    results[call['id']] = result
        
        return results
    
    def search_batch(self, queries: List[str]) -> List[str]:
        '''Single API call for multiple search queries'''
        # Many search APIs support batch requests
        return search_api.batch_search(queries)

# Cost comparison:
# Sequential: 5 search calls × $0.01 = $0.05
# Batched: 1 batch call = $0.02 (60% savings)
```

**4. Model Routing (Immediate, ~30% cost reduction):**
```python
class SmartModelRouter:
    '''Route to cheaper models when appropriate'''
    
    def route_query(self, query: str, task_type: str) -> str:
        # Simple queries → cheap model (GPT-3.5, $0.002/1K tokens)
        # Complex queries → expensive model (GPT-4, $0.03/1K tokens)
        
        complexity = self.estimate_complexity(query)
        
        if complexity == 'simple' or task_type in ['extraction', 'classification']:
            model = 'gpt-3.5-turbo'  # 15x cheaper
        else:
            model = 'gpt-4'
        
        return llm.generate(query, model=model)
    
    def estimate_complexity(self, query: str) -> str:
        '''Estimate query complexity'''
        # Heuristics:
        if len(query.split()) < 50:
            return 'simple'
        elif any(keyword in query.lower() for keyword in ['analyze', 'compare', 'complex']):
            return 'complex'
        else:
            return 'medium'

# Expected: 40-50% of queries can use cheaper model → 30% cost reduction
```

**5. Monitoring & Continuous Optimization:**
```python
class AgentCostMonitor:
    def __init__(self):
        self.metrics = defaultdict(lambda: {
            'calls': 0,
            'total_cost': 0,
            'avg_tool_calls': []
        })
    
    def log_execution(self, query: str, trace: dict, cost: float):
        day_key = datetime.now().strftime('%Y-%m-%d')
        self.metrics[day_key]['calls'] += 1
        self.metrics[day_key]['total_cost'] += cost
        self.metrics[day_key]['avg_tool_calls'].append(len(trace['tool_calls']))
    
    def analyze_trends(self):
        '''Identify cost optimization opportunities'''
        # Find queries with excessive tool calls
        expensive_patterns = []
        
        for query, trace in self.query_traces.items():
            if len(trace['tool_calls']) > 8:
                expensive_patterns.append({
                    'query': query,
                    'tool_calls': len(trace['tool_calls']),
                    'cost': trace['cost']
                })
        
        return expensive_patterns

# Weekly review:
# - Identify queries with >10 tool calls
# - Check for redundant tool usage
# - Optimize high-frequency expensive patterns
```

**Complete Optimization Results:**

| Technique | Cost Reduction | Implementation Time | Complexity |
|-----------|----------------|---------------------|------------|
| Tool caching | 20-25% | 1-2 days | Low |
| Model routing | 30% | 1 week | Medium |
| Plan-and-Execute | 40% | 2-3 weeks | High |
| Tool fusion | 50% | 3-4 weeks | High |

**Recommended Approach:**
1. Week 1: Implement caching + model routing (50% reduction, minimal risk)
2. Week 2-3: A/B test Plan-and-Execute on 20% traffic
3. Week 4: Gradual rollout if quality maintained
4. Result: $0.15 → $0.08 per query (47% reduction)

**Key Metrics to Monitor:**
- Cost per query (target: $0.08)
- Quality (BLEU score, user satisfaction)
- Latency (ensure caching doesn't add lag)
- Tool call distribution (watch for regressions)
        """,
    },
    {
        "level": "Senior",
        "question": "Your conversational assistant stores conversation history in memory. After 20 turns, context windows are full and the system starts dropping early messages, causing 'memory loss.' Design a production solution.",
        "answer": """
**Problem Analysis:**

**Current Situation:**
- 20-turn conversation: ~4000-6000 tokens
- Context limit: 8K tokens (leaving 2-4K for system prompt + output)
- System dropping first 10 turns → loses important context
- User complaints: "You forgot what we discussed earlier"

**Requirements:**
- Maintain conversation coherence across 50+ turns
- Keep memory cost reasonable (<500 tokens overhead)
- Preserve important context even from early turns
- Handle multiple concurrent users

**Production Solution: Hierarchical Memory System**

```python
from dataclasses import dataclass
from typing import List, Dict, Optional
import json

@dataclass
class MemoryEntry:
    turn: int
    role: str
    content: str
    tokens: int
    importance: float  # 0.0 to 1.0
    timestamp: str
    summary: Optional[str] = None

class HierarchicalMemory:
    '''Production conversation memory with intelligent compression'''
    
    def __init__(self, max_tokens: int = 2000):
        self.max_tokens = max_tokens
        self.short_term = []  # Recent messages (full text)
        self.long_term = []   # Older messages (summarized)
        self.facts = {}       # Extracted facts/entities
        self.current_tokens = 0
    
    def add_message(self, role: str, content: str, turn: int) -> None:
        '''Add message with automatic compression'''
        tokens = self.count_tokens(content)
        importance = self.calculate_importance(content, role)
        
        entry = MemoryEntry(
            turn=turn,
            role=role,
            content=content,
            tokens=tokens,
            importance=importance,
            timestamp=datetime.utcnow().isoformat()
        )
        
        # Add to short-term memory
        self.short_term.append(entry)
        self.current_tokens += tokens
        
        # Extract facts for long-term storage
        if importance > 0.7:  # Important information
            self.extract_facts(content, turn)
        
        # Compress if needed
        if self.current_tokens > self.max_tokens * 0.8:  # 80% threshold
            self.compress_memory()
    
    def calculate_importance(self, content: str, role: str) -> float:
        '''Estimate message importance'''
        score = 0.5  # Base score
        
        # User messages are generally more important
        if role == 'user':
            score += 0.2
        
        # Check for important keywords
        important_keywords = ['important', 'remember', 'note', 'key', 'critical']
        if any(kw in content.lower() for kw in important_keywords):
            score += 0.2
        
        # Questions are important
        if '?' in content:
            score += 0.1
        
        # Long messages might contain important details
        if len(content.split()) > 50:
            score += 0.1
        
        return min(1.0, score)
    
    def compress_memory(self) -> None:
        '''Move old messages to long-term memory with summarization'''
        # Keep last 5 messages in short-term (most recent context)
        if len(self.short_term) <= 5:
            return
        
        # Move older messages to long-term
        to_compress = self.short_term[:-5]
        self.short_term = self.short_term[-5:]
        
        # Summarize the compressed messages
        summary = self.summarize_messages(to_compress)
        
        # Store summary in long-term memory
        self.long_term.append({
            'turns': [m.turn for m in to_compress],
            'summary': summary,
            'tokens': self.count_tokens(summary),
            'important_facts': self.get_facts_for_turns([m.turn for m in to_compress])
        })
        
        # Recalculate token count
        self.current_tokens = sum(m.tokens for m in self.short_term)
        self.current_tokens += sum(lt['tokens'] for lt in self.long_term)
    
    def summarize_messages(self, messages: List[MemoryEntry]) -> str:
        '''Summarize a group of messages'''
        # Build conversation text
        conv_text = '\\n'.join([
            f"{m.role}: {m.content}" for m in messages
        ])
        
        # Use LLM to summarize (with specific prompt)
        prompt = f'''
        Summarize this conversation segment, preserving key facts and context:
        
        {conv_text}
        
        Format as: "In turns {messages[0].turn}-{messages[-1].turn}, the user discussed..."
        Keep it concise (< 100 words) but preserve important details.
        '''
        
        summary = llm.generate(prompt, temperature=0.3, max_tokens=150)
        return summary
    
    def extract_facts(self, content: str, turn: int) -> None:
        '''Extract key facts for long-term storage'''
        # Use LLM to extract structured facts
        prompt = f'''
        Extract key facts from this message:
        "{content}"
        
        Return JSON: {{"facts": [{{"type": "preference/name/date/etc", "value": "..."}}]}}
        '''
        
        try:
            response = llm.generate(prompt, temperature=0.0)
            facts = json.loads(response)['facts']
            
            for fact in facts:
                fact_key = f"{fact['type']}:{fact['value']}"
                self.facts[fact_key] = {
                    'value': fact['value'],
                    'type': fact['type'],
                    'turn': turn,
                    'last_mentioned': turn
                }
        except:
            pass  # Graceful degradation if extraction fails
    
    def get_context_for_llm(self) -> str:
        '''Build optimized context for LLM'''
        context_parts = []
        
        # 1. Persistent facts (always included)
        if self.facts:
            facts_text = self._format_facts()
            context_parts.append(f"[Known Facts]\\n{facts_text}")
        
        # 2. Long-term memory summaries
        for lt_entry in self.long_term:
            context_parts.append(f"[History] {lt_entry['summary']}")
        
        # 3. Recent conversation (full text)
        for msg in self.short_term:
            context_parts.append(f"{msg.role}: {msg.content}")
        
        return '\\n\\n'.join(context_parts)
    
    def _format_facts(self) -> str:
        '''Format extracted facts for context'''
        fact_lines = []
        for fact_key, fact_data in self.facts.items():
            fact_lines.append(f"- {fact_data['type'].title()}: {fact_data['value']}")
        return '\\n'.join(fact_lines[:10])  # Limit to top 10 facts
    
    def get_metrics(self) -> dict:
        '''Memory system metrics'''
        return {
            'short_term_messages': len(self.short_term),
            'long_term_segments': len(self.long_term),
            'total_facts': len(self.facts),
            'current_tokens': self.current_tokens,
            'token_utilization': self.current_tokens / self.max_tokens,
            'compression_ratio': self._calculate_compression_ratio()
        }
    
    def _calculate_compression_ratio(self) -> float:
        '''Calculate how much we've compressed'''
        if not self.long_term:
            return 1.0
        
        # Original tokens vs compressed tokens
        original_tokens = sum(
            sum(self.count_tokens(m.content) for m in segment['messages'])
            for segment in self.long_term
        )
        compressed_tokens = sum(lt['tokens'] for lt in self.long_term)
        
        return compressed_tokens / original_tokens if original_tokens > 0 else 1.0
    
    def count_tokens(self, text: str) -> int:
        '''Count tokens in text'''
        import tiktoken
        enc = tiktoken.encoding_for_model('gpt-4')
        return len(enc.encode(text))


# Alternative: Vector Memory (for very long conversations)
class VectorMemory:
    '''Store conversation in vector database for semantic retrieval'''
    
    def __init__(self):
        self.vector_store = ChromaDB()
        self.message_count = 0
    
    def add_message(self, role: str, content: str, turn: int) -> None:
        '''Store message as vector'''
        # Embed message
        embedding = embed_model.encode([content])[0]
        
        # Store in vector DB
        self.vector_store.add(
            id=f"turn_{turn}",
            embedding=embedding,
            metadata={'role': role, 'turn': turn, 'content': content}
        )
        
        self.message_count += 1
    
    def get_relevant_context(self, current_message: str, k: int = 5) -> List[dict]:
        '''Retrieve relevant past messages semantically'''
        # Embed current message
        query_embedding = embed_model.encode([current_message])[0]
        
        # Retrieve similar past messages
        results = self.vector_store.query(
            query_embedding=query_embedding,
            n_results=k
        )
        
        return results
    
    def get_context_for_llm(self, current_message: str) -> str:
        '''Build context from relevant history'''
        # Always include last 3 messages
        recent = self.vector_store.get_latest(n=3)
        
        # Add semantically relevant older messages
        relevant = self.get_relevant_context(current_message, k=3)
        
        # Combine
        context_messages = recent + relevant
        
        # Format
        return '\\n'.join([
            f"[Turn {m['turn']}] {m['role']}: {m['content']}"
            for m in sorted(context_messages, key=lambda x: x['turn'])
        ])
```

**Comparison of Approaches:**

| Approach | Max Turns | Token Overhead | Retrieval Speed | Complexity |
|----------|-----------|----------------|-----------------|------------|
| Naive (drop old) | 20 | 0 | N/A | Low |
| Full summarization | 30-40 | 200-300 | Fast | Medium |
| Hierarchical | 50-100 | 400-500 | Fast | High |
| Vector memory | Unlimited | 100 | 50-100ms | High |

**Production Deployment:**

```python
class ProductionConversationSystem:
    def __init__(self, user_id: str):
        self.memory = HierarchicalMemory(max_tokens=2000)
        self.user_id = user_id
        self.turn_count = 0
    
    def chat(self, user_message: str) -> str:
        self.turn_count += 1
        
        # Add user message to memory
        self.memory.add_message('user', user_message, self.turn_count)
        
        # Get optimized context
        context = self.memory.get_context_for_llm()
        
        # Build prompt
        prompt = f'''
        {context}
        
        user: {user_message}
        
        assistant:'''
        
        # Generate response
        response = llm.generate(prompt)
        
        # Add response to memory
        self.memory.add_message('assistant', response, self.turn_count)
        
        return response
    
    def persist_memory(self):
        '''Save memory to database for user'''
        memory_state = {
            'user_id': self.user_id,
            'turn_count': self.turn_count,
            'short_term': [asdict(m) for m in self.memory.short_term],
            'long_term': self.memory.long_term,
            'facts': self.memory.facts
        }
        
        db.save_conversation_state(self.user_id, memory_state)
    
    def restore_memory(self):
        '''Restore memory from database'''
        state = db.load_conversation_state(self.user_id)
        if state:
            self.turn_count = state['turn_count']
            # Restore memory structures
            # ...
```

**Key Benefits:**
- Handles 50-100 turn conversations (vs 20 before)
- Maintains important context from early turns via facts + summaries
- Token overhead: 400-500 tokens (vs 4000+ for full history)
- Compression ratio: 5-10x (500 tokens stores equivalent of 5000)
- Cost savings: 60-70% on longer conversations

**Metrics to Track:**
- Average turns before context loss
- User satisfaction with memory quality
- Token usage per conversation
- Fact extraction accuracy
        """,
    },
    {
        "level": "Staff",
        "question": "Design a production agent system that can use 20+ tools reliably. Include tool selection optimization, error recovery, execution monitoring, and explain how to prevent tool hallucination (calling non-existent tools or wrong parameters).",
        "answer": """
**Production Multi-Tool Agent Architecture:**

**1. Tool Registry & Validation:**

```python
from pydantic import BaseModel, Field, validator
from typing import List, Dict, Any, Optional, Callable
import inspect

class ToolParameter(BaseModel):
    name: str
    type: str  # "string", "number", "boolean", "array", "object"
    description: str
    required: bool = True
    enum: Optional[List[Any]] = None
    
class ToolDefinition(BaseModel):
    name: str
    description: str
    parameters: List[ToolParameter]
    returns: str
    examples: List[dict]
    cost_estimate: float  # Relative cost (1.0 = baseline)
    avg_latency_ms: float
    error_rate: float
    
class ToolRegistry:
    '''Central registry for all available tools'''
    
    def __init__(self):
        self.tools = {}
        self.tool_usage_stats = defaultdict(lambda: {
            'calls': 0,
            'successes': 0,
            'failures': 0,
            'avg_latency': []
        })
    
    def register_tool(self, func: Callable, definition: ToolDefinition):
        '''Register a tool with validation'''
        # Validate function signature matches definition
        sig = inspect.signature(func)
        param_names = set(sig.parameters.keys())
        def_param_names = set(p.name for p in definition.parameters)
        
        if param_names != def_param_names:
            raise ValueError(f"Function signature doesn't match definition for {definition.name}")
        
        self.tools[definition.name] = {
            'function': func,
            'definition': definition,
            'enabled': True
        }
    
    def get_tool_definitions_for_llm(self, limit: int = None) -> List[dict]:
        '''Get tool definitions in format for LLM'''
        # Optionally limit tools based on context
        tools_list = list(self.tools.values())
        
        if limit:
            # Prioritize by success rate and frequency
            tools_list.sort(key=lambda t: (
                self.tool_usage_stats[t['definition'].name]['successes'],
                -t['definition'].error_rate
            ), reverse=True)
            tools_list = tools_list[:limit]
        
        return [
            {
                'name': tool['definition'].name,
                'description': tool['definition'].description,
                'parameters': {
                    'type': 'object',
                    'properties': {
                        p.name: {
                            'type': p.type,
                            'description': p.description,
                            'enum': p.enum
                        }
                        for p in tool['definition'].parameters
                    },
                    'required': [p.name for p in tool['definition'].parameters if p.required]
                },
                'examples': tool['definition'].examples[:2]  # Include examples!
            }
            for tool in tools_list
            if tool['enabled']
        ]
    
    def validate_tool_call(self, tool_name: str, args: dict) -> tuple[bool, Optional[str]]:
        '''Validate tool call before execution'''
        if tool_name not in self.tools:
            return False, f"Tool '{tool_name}' not found. Available: {list(self.tools.keys())}"
        
        tool_def = self.tools[tool_name]['definition']
        
        # Check required parameters
        required_params = [p.name for p in tool_def.parameters if p.required]
        missing = set(required_params) - set(args.keys())
        if missing:
            return False, f"Missing required parameters: {missing}"
        
        # Validate parameter types
        for param in tool_def.parameters:
            if param.name in args:
                value = args[param.name]
                expected_type = param.type
                
                # Type validation
                if expected_type == 'string' and not isinstance(value, str):
                    return False, f"Parameter '{param.name}' must be a string"
                elif expected_type == 'number' and not isinstance(value, (int, float)):
                    return False, f"Parameter '{param.name}' must be a number"
                # ... more type checks
                
                # Enum validation
                if param.enum and value not in param.enum:
                    return False, f"Parameter '{param.name}' must be one of {param.enum}"
        
        return True, None


# Example tool definitions
search_tool = ToolDefinition(
    name="web_search",
    description="Search the internet for information. Use for current events, facts, and recent data.",
    parameters=[
        ToolParameter(name="query", type="string", description="Search query", required=True),
        ToolParameter(name="num_results", type="number", description="Number of results (1-10)", required=False)
    ],
    returns="List of search results with title, snippet, and URL",
    examples=[
        {"query": "latest news on AI", "num_results": 5},
        {"query": "weather in San Francisco"}
    ],
    cost_estimate=1.0,
    avg_latency_ms=200,
    error_rate=0.02
)

calc_tool = ToolDefinition(
    name="calculator",
    description="Perform mathematical calculations. Supports +, -, *, /, **, sqrt, etc.",
    parameters=[
        ToolParameter(name="expression", type="string", description="Mathematical expression to evaluate", required=True)
    ],
    returns="Numerical result of the calculation",
    examples=[
        {"expression": "2 + 2"},
        {"expression": "sqrt(144) * 3"}
    ],
    cost_estimate=0.1,
    avg_latency_ms=5,
    error_rate=0.01
)

registry = ToolRegistry()
registry.register_tool(web_search_function, search_tool)
registry.register_tool(calculator_function, calc_tool)
```

**2. Intelligent Tool Selection (Prevent Hallucination):**

```python
class ToolSelectionOptimizer:
    '''Optimize tool selection to prevent hallucination'''
    
    def __init__(self, registry: ToolRegistry):
        self.registry = registry
        self.selection_history = []
    
    def select_tools_for_query(self, query: str, max_tools: int = 5) -> List[dict]:
        '''Select most relevant tools for query'''
        # Strategy 1: Keyword matching
        query_lower = query.lower()
        keyword_scores = {}
        
        for tool_name, tool in self.registry.tools.items():
            score = 0
            desc = tool['definition'].description.lower()
            
            # Simple keyword overlap
            query_words = set(query_lower.split())
            desc_words = set(desc.split())
            overlap = len(query_words & desc_words)
            score += overlap * 0.5
            
            # Boost frequently successful tools
            stats = self.registry.tool_usage_stats[tool_name]
            if stats['calls'] > 0:
                success_rate = stats['successes'] / stats['calls']
                score += success_rate * 0.3
            
            keyword_scores[tool_name] = score
        
        # Select top-k tools
        top_tools = sorted(keyword_scores.items(), key=lambda x: x[1], reverse=True)[:max_tools]
        tool_names = [name for name, score in top_tools if score > 0]
        
        # Get definitions for selected tools
        selected_defs = [
            self.registry.tools[name]['definition']
            for name in tool_names
        ]
        
        return self.registry.get_tool_definitions_for_llm(limit=max_tools)
    
    def build_tool_prompt(self, query: str, relevant_tools: List[dict]) -> str:
        '''Build prompt with only relevant tools'''
        prompt = f'''
You have access to these tools:

{json.dumps(relevant_tools, indent=2)}

CRITICAL INSTRUCTIONS:
1. You can ONLY use tools listed above
2. You MUST use exact tool names (case-sensitive)
3. You MUST provide all required parameters
4. Check the examples for correct usage
5. If you need a tool that's not listed, respond: "I don't have access to that tool"

Query: {query}

Respond in JSON:
{{
    "thought": "What I need to do...",
    "tool_calls": [
        {{"tool": "exact_tool_name", "args": {{...}} }}
    ]
}}
'''
        return prompt


# Example: Dynamic tool selection
optimizer = ToolSelectionOptimizer(registry)

queries = [
    "What's 15% of 250?",  # Should select calculator
    "Latest news on SpaceX",  # Should select web_search
    "Search for Python tutorials and calculate 2+2"  # Should select both
]

for query in queries:
    tools = optimizer.select_tools_for_query(query, max_tools=3)
    print(f"Query: {query}")
    print(f"Selected tools: {[t['name'] for t in tools]}")
    print()
```

**3. Robust Execution with Error Recovery:**

```python
class RobustToolExecutor:
    '''Execute tools with validation, retry, and error recovery'''
    
    def __init__(self, registry: ToolRegistry, max_retries: int = 2):
        self.registry = registry
        self.max_retries = max_retries
        self.execution_log = []
    
    def execute_plan(self, tool_calls: List[dict]) -> List[dict]:
        '''Execute multiple tool calls with dependency handling'''
        results = {}
        failed_calls = []
        
        for i, call in enumerate(tool_calls):
            tool_name = call['tool']
            args = call['args']
            
            # Validate before execution
            valid, error = self.registry.validate_tool_call(tool_name, args)
            if not valid:
                failed_calls.append({
                    'tool': tool_name,
                    'args': args,
                    'error': error,
                    'error_type': 'validation'
                })
                continue
            
            # Execute with retry
            result = self.execute_with_retry(tool_name, args, call_id=i)
            
            if result['status'] == 'success':
                results[f"call_{i}"] = result['output']
            else:
                failed_calls.append({
                    'tool': tool_name,
                    'args': args,
                    'error': result['error'],
                    'error_type': 'execution'
                })
        
        return {
            'results': results,
            'failed_calls': failed_calls,
            'success_rate': len(results) / len(tool_calls) if tool_calls else 0
        }
    
    def execute_with_retry(self, tool_name: str, args: dict, call_id: int) -> dict:
        '''Execute single tool with automatic retry'''
        import time
        
        for attempt in range(self.max_retries):
            try:
                # Get tool function
                tool = self.registry.tools[tool_name]['function']
                
                # Execute with timeout
                start = time.time()
                output = self._execute_with_timeout(tool, args, timeout_seconds=30)
                latency = (time.time() - start) * 1000
                
                # Log success
                self._log_execution(tool_name, args, 'success', latency, None)
                
                # Update stats
                self.registry.tool_usage_stats[tool_name]['calls'] += 1
                self.registry.tool_usage_stats[tool_name]['successes'] += 1
                self.registry.tool_usage_stats[tool_name]['avg_latency'].append(latency)
                
                return {
                    'status': 'success',
                    'output': output,
                    'latency_ms': latency,
                    'attempt': attempt + 1
                }
                
            except TimeoutError as e:
                error_msg = f"Tool execution timed out after 30s"
                if attempt == self.max_retries - 1:
                    self._log_execution(tool_name, args, 'timeout', 0, error_msg)
                    return {'status': 'error', 'error': error_msg, 'error_type': 'timeout'}
                time.sleep(1)  # Brief delay before retry
                
            except Exception as e:
                error_msg = str(e)
                if attempt == self.max_retries - 1:
                    self._log_execution(tool_name, args, 'error', 0, error_msg)
                    return {'status': 'error', 'error': error_msg, 'error_type': 'exception'}
                time.sleep(0.5)
        
        return {'status': 'error', 'error': 'Max retries exceeded'}
    
    def _execute_with_timeout(self, func: Callable, args: dict, timeout_seconds: int) -> Any:
        '''Execute function with timeout'''
        import signal
        
        def timeout_handler(signum, frame):
            raise TimeoutError()
        
        # Set timeout
        signal.signal(signal.SIGALRM, timeout_handler)
        signal.alarm(timeout_seconds)
        
        try:
            result = func(**args)
        finally:
            signal.alarm(0)  # Cancel timeout
        
        return result
    
    def _log_execution(self, tool_name: str, args: dict, status: str, latency: float, error: Optional[str]):
        '''Log tool execution for monitoring'''
        self.execution_log.append({
            'timestamp': datetime.utcnow().isoformat(),
            'tool': tool_name,
            'args': args,
            'status': status,
            'latency_ms': latency,
            'error': error
        })
```

**4. Agent with Tool Hallucination Prevention:**

```python
class ProductionToolAgent:
    '''Production agent with robust tool usage'''
    
    def __init__(self, registry: ToolRegistry):
        self.registry = registry
        self.executor = RobustToolExecutor(registry)
        self.optimizer = ToolSelectionOptimizer(registry)
        self.failures = []
    
    def run(self, query: str) -> dict:
        '''Execute query with tool usage'''
        # Step 1: Select relevant tools
        relevant_tools = self.optimizer.select_tools_for_query(query, max_tools=5)
        
        # Step 2: Generate plan
        plan_prompt = self.optimizer.build_tool_prompt(query, relevant_tools)
        plan_response = llm.generate(plan_prompt, temperature=0.2)
        
        # Step 3: Parse and validate plan
        try:
            plan = json.loads(plan_response)
            tool_calls = plan.get('tool_calls', [])
        except json.JSONDecodeError:
            return {
                'status': 'error',
                'error': 'Failed to parse LLM response as JSON',
                'raw_response': plan_response
            }
        
        # Step 4: Validate tool calls
        invalid_calls = []
        valid_calls = []
        
        for call in tool_calls:
            tool_name = call.get('tool')
            
            # Check if tool exists
            if tool_name not in self.registry.tools:
                invalid_calls.append({
                    'tool': tool_name,
                    'error': f"Tool '{tool_name}' does not exist",
                    'available_tools': list(self.registry.tools.keys())
                })
                continue
            
            # Validate parameters
            args = call.get('args', {})
            valid, error = self.registry.validate_tool_call(tool_name, args)
            
            if valid:
                valid_calls.append(call)
            else:
                invalid_calls.append({
                    'tool': tool_name,
                    'args': args,
                    'error': error
                })
        
        # Step 5: Handle invalid calls with retry
        if invalid_calls and not valid_calls:
            # All calls invalid - ask LLM to fix
            fix_prompt = f'''
Your previous tool calls were invalid:
{json.dumps(invalid_calls, indent=2)}

Available tools:
{json.dumps(relevant_tools, indent=2)}

Please provide corrected tool calls for the query: "{query}"
'''
            corrected_response = llm.generate(fix_prompt, temperature=0.1)
            # Parse and validate again (with recursion limit)
            # ...
        
        # Step 6: Execute valid calls
        execution_result = self.executor.execute_plan(valid_calls)
        
        # Step 7: Synthesize final answer
        answer = self.synthesize_answer(query, execution_result, plan.get('thought'))
        
        return {
            'status': 'success' if not execution_result['failed_calls'] else 'partial',
            'answer': answer,
            'tool_calls': len(valid_calls),
            'invalid_calls': len(invalid_calls),
            'failed_executions': len(execution_result['failed_calls']),
            'execution_details': execution_result
        }
    
    def synthesize_answer(self, query: str, execution_result: dict, thought: str) -> str:
        '''Synthesize final answer from tool results'''
        prompt = f'''
Query: {query}

Plan: {thought}

Tool Results:
{json.dumps(execution_result['results'], indent=2)}

Synthesize a clear answer to the query based on these results.
'''
        
        return llm.generate(prompt, temperature=0.5)
```

**5. Monitoring & Alert System:**

```python
class ToolMonitor:
    '''Monitor tool usage and detect issues'''
    
    def __init__(self, registry: ToolRegistry):
        self.registry = registry
        self.alerts = []
    
    def check_health(self) -> List[dict]:
        '''Check tool system health'''
        alerts = []
        
        for tool_name, stats in self.registry.tool_usage_stats.items():
            if stats['calls'] < 10:
                continue  # Not enough data
            
            # Check error rate
            error_rate = stats['failures'] / stats['calls']
            if error_rate > 0.15:  # 15% error rate threshold
                alerts.append({
                    'type': 'high_error_rate',
                    'tool': tool_name,
                    'error_rate': error_rate,
                    'severity': 'high'
                })
            
            # Check latency regression
            if len(stats['avg_latency']) >= 100:
                recent_latency = np.mean(stats['avg_latency'][-100:])
                baseline_latency = self.registry.tools[tool_name]['definition'].avg_latency_ms
                
                if recent_latency > baseline_latency * 2:  # 2x slower
                    alerts.append({
                        'type': 'latency_regression',
                        'tool': tool_name,
                        'current_latency': recent_latency,
                        'baseline': baseline_latency,
                        'severity': 'medium'
                    })
        
        return alerts
    
    def get_tool_usage_report(self) -> dict:
        '''Generate usage report for analysis'''
        report = {}
        
        for tool_name, stats in self.registry.tool_usage_stats.items():
            if stats['calls'] > 0:
                report[tool_name] = {
                    'calls': stats['calls'],
                    'success_rate': stats['successes'] / stats['calls'],
                    'avg_latency_ms': np.mean(stats['avg_latency']) if stats['avg_latency'] else 0,
                    'p95_latency_ms': np.percentile(stats['avg_latency'], 95) if len(stats['avg_latency']) > 10 else 0
                }
        
        return report
```

**Key Benefits:**
- Prevents tool hallucination with validation at multiple levels
- Handles 20+ tools by dynamically selecting relevant subset (5-7 per query)
- Automatic retry and error recovery
- Real-time monitoring and alerting
- Cost optimization through tool selection

**Production Metrics:**
- Tool hallucination rate: <1% (vs 10-15% without validation)
- Tool success rate: >95% (with retries)
- Average tool calls per query: 2.5 (down from 5-10 with selection)
- Cost per query: 40% reduction through smart selection
        """,
    },
]

for i, qa in enumerate(interview_questions_langchain, 1):
    print(f"\n{'=' * 100}")
    print(f"LANGCHAIN & AGENTS - Q{i} [{qa['level']} Level]")
    print('=' * 100)
    print(f"\n{qa['question']}\n")
    print("ANSWER:")
    print(qa['answer'])
    print()


## Module 3 Summary: Production LangChain & Agents

### Key Takeaways for Experienced Engineers

Production chain and agent systems require careful cost management, memory optimization, and robust tool orchestration.


In [None]:
print("MODULE 3: LANGCHAIN & AGENTS - KEY TAKEAWAYS")
print("=" * 100)

summary = {
    "Cost Optimization": [
        "Tool caching: 20-25% cost reduction with minimal implementation (1-2 days)",
        "Model routing: 30% reduction by routing simple tasks to GPT-3.5 (15x cheaper)",
        "Plan-and-Execute: 40% reduction vs ReAct (2 LLM calls vs 5-10)",
        "Tool fusion: Batch multiple calls into single API request (50%+ savings)",
        "Monitor tool usage: Track redundant calls (15% of calls can be eliminated)",
    ],
    "Memory Management": [
        "Hierarchical memory: Short-term (recent) + Long-term (summarized) + Facts",
        "Compression ratio: 5-10x (500 tokens represents 5000 tokens of history)",
        "Handles 50-100 turns vs 20 with naive approach",
        "Fact extraction: Preserve important context from early turns",
        "Vector memory: Alternative for unlimited history with semantic retrieval",
    ],
    "Tool Orchestration": [
        "Tool registry: Central validation and monitoring for all tools",
        "Dynamic selection: Show LLM only 5-7 relevant tools per query (not all 20+)",
        "Validation layers: Registry → Parameter types → Execution → Retry",
        "Examples in prompts: Reduces tool hallucination from 10-15% to <1%",
        "Automatic retry: 2-3 attempts with timeout handling (95%+ success rate)",
    ],
    "Chain Design": [
        "Robust chains: Automatic retry with feedback to LLM on parse failures",
        "Metrics tracking: Success rate, latency, token usage per chain execution",
        "Error recovery: Graceful degradation rather than hard failures",
        "Output validation: Pydantic schemas with structured parsing",
    ],
    "Production Patterns": [
        "ReAct: Good for exploration, expensive (5-10 LLM calls)",
        "Plan-and-Execute: Better for production, efficient (2-3 LLM calls)",
        "Tool caching: Essential for repeated queries (30% cache hit rate typical)",
        "Monitoring: Tool success rate, latency regression, cost per query",
    ],
}

for section, points in summary.items():
    print(f"\n{section}:")
    for point in points:
        print(f"  - {point}")

print("\n" + "=" * 100)
print("\nINTERVIEW QUESTIONS SUMMARY:")
print("  - Cost Optimization: Reduce agent cost from $0.30 to $0.15 per query")
print("  - Memory Management: Hierarchical memory for 50-100 turn conversations")
print("  - Tool Orchestration: Reliable 20+ tool system with hallucination prevention")
print("  Total: 3 advanced questions (2 Senior, 1 Staff level)")

print("\n" + "=" * 100)
print("\nNEXT STEPS:")
print("  1. Implement tool caching + model routing for immediate 50% cost reduction")
print("  2. Deploy hierarchical memory for long conversations")
print("  3. Build tool registry with validation and monitoring")
print("  4. A/B test Plan-and-Execute pattern vs ReAct")
print("  5. Continue to Module 4: LangGraph (stateful workflows, HITL)")

print("\n" + "=" * 100)


### Section 2: Agent Architectures

LangChain supports multiple agent patterns:
- **ReAct**: Reasoning + Acting in interleaved steps
- **Plan-Execute**: Plan first, then execute
- **Self-Ask**: Breaks down complex questions
- **Structured Chat**: Uses structured format for tool calls

In [None]:
from typing import List, Dict, Any, Optional, Callable
import json

class Tool:
    '''Tool that agent can use'''
    def __init__(self, name: str, func: Callable, description: str):
        self.name = name
        self.func = func
        self.description = description
    
    def run(self, *args, **kwargs):
        return self.func(*args, **kwargs)

class ReActAgent:
    '''ReAct pattern: Reason + Act in loop'''
    
    def __init__(self, llm_func: Callable, tools: List[Tool], max_iterations=5):
        self.llm = llm_func
        self.tools = {t.name: t for t in tools}
        self.max_iterations = max_iterations
    
    def run(self, task: str) -> dict:
        '''Execute ReAct loop'''
        conversation = []
        
        for i in range(self.max_iterations):
            # Reasoning step
            prompt = self._build_prompt(task, conversation)
            response = self.llm(prompt)
            
            thought, action, action_input = self._parse_response(response)
            
            conversation.append({
                'step': i + 1,
                'thought': thought,
                'action': action,
                'action_input': action_input
            })
            
            # Check if done
            if action == 'Final Answer':
                return {
                    'answer': action_input,
                    'steps': conversation,
                    'iterations': i + 1
                }
            
            # Acting step
            if action in self.tools:
                observation = self.tools[action].run(action_input)
                conversation.append({
                    'observation': observation
                })
            else:
                conversation.append({
                    'observation': f'Error: Unknown action {action}'
                })
        
        return {
            'answer': 'Max iterations reached',
            'steps': conversation,
            'iterations': self.max_iterations
        }
    
    def _build_prompt(self, task: str, conversation: List[dict]) -> str:
        '''Build ReAct prompt'''
        tool_descriptions = '\n'.join([
            f'- {name}: {tool.description}'
            for name, tool in self.tools.items()
        ])
        
        prompt = f'''Answer the following question using this format:

Thought: [Your reasoning about what to do next]
Action: [Tool name or "Final Answer"]
Action Input: [Input to the tool or final answer]

Available tools:
{tool_descriptions}

Question: {task}

'''
        
        # Add conversation history
        for step in conversation:
            if 'thought' in step:
                prompt += f"Thought: {step['thought']}\n"
                prompt += f"Action: {step['action']}\n"
                prompt += f"Action Input: {step['action_input']}\n"
            if 'observation' in step:
                prompt += f"Observation: {step['observation']}\n\n"
        
        return prompt
    
    def _parse_response(self, response: str) -> tuple:
        '''Parse LLM response into thought, action, input'''
        # Simplified parser
        lines = response.strip().split('\n')
        
        thought = ''
        action = ''
        action_input = ''
        
        for line in lines:
            if line.startswith('Thought:'):
                thought = line.replace('Thought:', '').strip()
            elif line.startswith('Action:'):
                action = line.replace('Action:', '').strip()
            elif line.startswith('Action Input:'):
                action_input = line.replace('Action Input:', '').strip()
        
        return thought, action, action_input

# Example tools
def calculator(expression: str) -> str:
    '''Calculator tool'''
    try:
        result = eval(expression)  # UNSAFE in prod - use safe_eval
        return f"Result: {result}"
    except Exception as e:
        return f"Error: {e}"

def search(query: str) -> str:
    '''Mock search tool'''
    # In prod, call actual search API
    results = {
        'population of France': '67 million',
        'capital of Japan': 'Tokyo',
        'Python release year': '1991',
    }
    for key, value in results.items():
        if key.lower() in query.lower():
            return value
    return 'No results found'

# Mock LLM
def mock_llm(prompt: str) -> str:
    '''Mock LLM that follows ReAct format'''
    if 'population of France' in prompt.lower():
        if 'Observation:' not in prompt:
            return '''Thought: I need to search for population data
Action: search
Action Input: population of France'''
        else:
            return '''Thought: I have the answer
Action: Final Answer
Action Input: The population of France is 67 million'''
    return '''Thought: Task complete
Action: Final Answer
Action Input: Done'''

# Test ReAct agent
tools = [
    Tool('calculator', calculator, 'Performs mathematical calculations'),
    Tool('search', search, 'Searches for information online'),
]

agent = ReActAgent(mock_llm, tools, max_iterations=5)

print('REACT AGENT DEMONSTRATION')
print('=' * 90)

task = 'What is the population of France?'
result = agent.run(task)

print(f'\nTask: {task}\n')
print(f'Answer: {result["answer"]}')
print(f'Iterations: {result["iterations"]}\n')

print('Steps taken:')
for step in result['steps']:
    if 'thought' in step:
        print(f"\n  Step {step['step']}:")
        print(f"    Thought: {step['thought']}")
        print(f"    Action: {step['action']}")
        print(f"    Input: {step['action_input']}")
    if 'observation' in step:
        print(f"    Observation: {step['observation']}")

print('\n' + '=' * 90)

### Section 3: Memory Management

Different memory types for different needs:
- **ConversationBufferMemory**: Store all messages
- **ConversationSummaryMemory**: Summarize old messages
- **ConversationTokenBufferMemory**: Limit by token count
- **VectorStoreMemory**: Semantic search over history

In [None]:
from collections import deque
import tiktoken

class MemoryManager:
    '''Production memory management for conversational systems'''
    
    def __init__(self, strategy='sliding_window', max_tokens=2000):
        self.strategy = strategy
        self.max_tokens = max_tokens
        self.messages = []
        self.encoding = tiktoken.encoding_for_model('gpt-4')
    
    def add_message(self, role: str, content: str):
        '''Add message to memory'''
        self.messages.append({'role': role, 'content': content})
        
        # Apply memory strategy
        if self.strategy == 'sliding_window':
            self._apply_sliding_window()
        elif self.strategy == 'summary':
            self._apply_summary()
        elif self.strategy == 'semantic':
            self._apply_semantic()
    
    def _count_tokens(self, messages: List[dict]) -> int:
        '''Count tokens in message list'''
        total = 0
        for msg in messages:
            total += len(self.encoding.encode(msg['content']))
        return total
    
    def _apply_sliding_window(self):
        '''Keep only recent messages within token budget'''
        while self._count_tokens(self.messages) > self.max_tokens and len(self.messages) > 2:
            # Keep system message (first) and user message (last), remove oldest
            if len(self.messages) > 2:
                self.messages.pop(1)  # Remove second message (oldest non-system)
    
    def _apply_summary(self):
        '''Summarize old messages when over budget'''
        if self._count_tokens(self.messages) > self.max_tokens:
            # Summarize messages beyond token limit
            # (Simplified - in prod, call LLM to generate summary)
            old_messages = self.messages[:-4]  # Keep last 4 messages
            summary = self._generate_summary(old_messages)
            
            self.messages = [
                {'role': 'system', 'content': f'Previous conversation summary: {summary}'},
                *self.messages[-4:]
            ]
    
    def _generate_summary(self, messages: List[dict]) -> str:
        '''Generate summary of messages'''
        # Mock summary - in prod, use LLM
        return f'Discussed {len(messages)} topics'
    
    def _apply_semantic(self):
        '''Keep semantically relevant messages'''
        # In production, use vector similarity to keep relevant messages
        pass
    
    def get_messages(self) -> List[dict]:
        '''Get messages for LLM context'''
        return self.messages
    
    def get_summary(self) -> dict:
        '''Get memory stats'''
        return {
            'total_messages': len(self.messages),
            'total_tokens': self._count_tokens(self.messages),
            'strategy': self.strategy,
            'max_tokens': self.max_tokens,
        }

# Test memory strategies
print('MEMORY MANAGEMENT DEMONSTRATION')
print('=' * 90)

for strategy in ['sliding_window', 'summary']:
    print(f'\nStrategy: {strategy}')
    print('-' * 90)
    
    memory = MemoryManager(strategy=strategy, max_tokens=500)
    
    # Simulate conversation
    memory.add_message('system', 'You are a helpful assistant.')
    memory.add_message('user', 'What is machine learning?')
    memory.add_message('assistant', 'Machine learning is a subset of AI that enables systems to learn from data.')
    memory.add_message('user', 'Tell me about neural networks.')
    memory.add_message('assistant', 'Neural networks are computing systems inspired by biological neural networks.')
    memory.add_message('user', 'What about deep learning?')
    memory.add_message('assistant', 'Deep learning uses neural networks with multiple layers for complex patterns.')
    memory.add_message('user', 'Explain transformers.')
    
    # Check memory state
    summary = memory.get_summary()
    print(f'  Messages retained: {summary["total_messages"]}')
    print(f'  Tokens used: {summary["total_tokens"]}/{summary["max_tokens"]}')
    print(f'  Latest messages:')
    for msg in memory.get_messages()[-2:]:
        print(f'    {msg["role"]}: {msg["content"][:60]}...')

print('\n' + '=' * 90)
print('KEY INSIGHT: Choose memory strategy based on use case')
print('  - Sliding window: Simple, predictable token usage')
print('  - Summary: Better context retention, more expensive')
print('  - Semantic: Best for long conversations, most complex')

### Section 4: Evaluation Framework

Production evaluation requires:
- **Accuracy metrics**: Exact match, F1, BLEU
- **Semantic metrics**: BERTScore, embedding similarity
- **Task-specific**: Groundedness, hallucination detection
- **User metrics**: Satisfaction, task completion

In [None]:
import pandas as pd
from typing import Callable, List, Dict
import numpy as np

class ComprehensiveEvaluator:
    '''Multi-metric evaluation framework for LLM systems'''
    
    def __init__(self):
        self.test_cases = []
        self.results = []
    
    def add_test_case(self, 
                     name: str, 
                     input_data: str, 
                     expected_output: Any, 
                     category: str = 'general',
                     metadata: dict = None):
        '''Add test case with metadata'''
        self.test_cases.append({
            'name': name,
            'input': input_data,
            'expected': expected_output,
            'category': category,
            'metadata': metadata or {}
        })
    
    def run_evaluation(self, system_func: Callable) -> pd.DataFrame:
        '''Run all tests and collect metrics'''
        self.results = []
        
        for test in self.test_cases:
            import time
            start = time.time()
            
            try:
                actual = system_func(test['input'])
                latency_ms = (time.time() - start) * 1000
                
                # Calculate multiple metrics
                exact_match = str(actual) == str(test['expected'])
                semantic_sim = self._semantic_similarity(str(actual), str(test['expected']))
                
                self.results.append({
                    'name': test['name'],
                    'category': test['category'],
                    'exact_match': exact_match,
                    'semantic_similarity': semantic_sim,
                    'latency_ms': latency_ms,
                    'expected': test['expected'],
                    'actual': actual,
                    'status': 'success',
                    'error': None
                })
            
            except Exception as e:
                latency_ms = (time.time() - start) * 1000
                self.results.append({
                    'name': test['name'],
                    'category': test['category'],
                    'exact_match': False,
                    'semantic_similarity': 0.0,
                    'latency_ms': latency_ms,
                    'expected': test['expected'],
                    'actual': None,
                    'status': 'error',
                    'error': str(e)
                })
        
        return pd.DataFrame(self.results)
    
    def _semantic_similarity(self, text1: str, text2: str) -> float:
        '''Calculate semantic similarity (simplified)'''
        # In production, use sentence-transformers
        # For demo, use simple word overlap
        words1 = set(text1.lower().split())
        words2 = set(text2.lower().split())
        
        if not words1 or not words2:
            return 0.0
        
        intersection = words1 & words2
        union = words1 | words2
        
        return len(intersection) / len(union)  # Jaccard similarity
    
    def get_summary(self, by_category=True) -> dict:
        '''Get evaluation summary statistics'''
        if not self.results:
            return {}
        
        df = pd.DataFrame(self.results)
        
        summary = {
            'total_tests': len(df),
            'exact_match_rate': df['exact_match'].mean(),
            'avg_semantic_sim': df['semantic_similarity'].mean(),
            'avg_latency_ms': df['latency_ms'].mean(),
            'p95_latency_ms': df['latency_ms'].quantile(0.95),
            'error_rate': (df['status'] == 'error').mean(),
        }
        
        if by_category:
            category_stats = df.groupby('category').agg({
                'exact_match': 'mean',
                'semantic_similarity': 'mean',
                'latency_ms': 'mean',
            }).to_dict()
            summary['by_category'] = category_stats
        
        return summary
    
    def get_failing_tests(self, threshold=0.7) -> List[dict]:
        '''Get tests with low semantic similarity'''
        df = pd.DataFrame(self.results)
        failing = df[df['semantic_similarity'] < threshold]
        return failing.to_dict('records')
    
    def compare_runs(self, baseline_results: List[dict]) -> dict:
        '''Compare current run against baseline'''
        current_df = pd.DataFrame(self.results)
        baseline_df = pd.DataFrame(baseline_results)
        
        # Merge on test name
        merged = current_df.merge(baseline_df, on='name', suffixes=('_current', '_baseline'))
        
        comparison = {
            'accuracy_change': merged['exact_match_current'].mean() - merged['exact_match_baseline'].mean(),
            'latency_change_ms': merged['latency_ms_current'].mean() - merged['latency_ms_baseline'].mean(),
            'improved_tests': len(merged[merged['semantic_similarity_current'] > merged['semantic_similarity_baseline']]),
            'regressed_tests': len(merged[merged['semantic_similarity_current'] < merged['semantic_similarity_baseline']]),
        }
        
        return comparison

# Example: Evaluate Q&A system
def simple_qa_system(question: str) -> str:
    '''Mock Q&A system'''
    qa_pairs = {
        'What is the capital of France?': 'Paris',
        'What is 2+2?': '4',
        'Who invented Python?': 'Guido van Rossum',
        'What is machine learning?': 'ML is a subset of AI that learns from data',
    }
    
    # Fuzzy matching
    for q, a in qa_pairs.items():
        if q.lower() in question.lower() or question.lower() in q.lower():
            return a
    
    return 'I don\'t know'

# Build evaluation suite
print('COMPREHENSIVE EVALUATION FRAMEWORK')
print('=' * 90)

evaluator = ComprehensiveEvaluator()

# Add diverse test cases
evaluator.add_test_case('geo_1', 'What is the capital of France?', 'Paris', 'geography')
evaluator.add_test_case('math_1', 'Calculate 2+2', '4', 'arithmetic')
evaluator.add_test_case('tech_1', 'Who invented Python?', 'Guido van Rossum', 'technology')
evaluator.add_test_case('ml_1', 'Explain machine learning', 'ML is a subset of AI', 'technical')
evaluator.add_test_case('unknown_1', 'What is the meaning of life?', 'I don\'t know', 'edge_case')

# Run evaluation
results_df = evaluator.run_evaluation(simple_qa_system)

print('\nTEST RESULTS:')
print(results_df[['name', 'category', 'exact_match', 'semantic_similarity', 'latency_ms']].to_string(index=False))

# Summary statistics
print('\n' + '=' * 90)
print('SUMMARY STATISTICS:')
summary = evaluator.get_summary()
for key, value in summary.items():
    if key != 'by_category':
        if isinstance(value, float):
            print(f'  {key}: {value:.2%}' if value < 1 else f'  {key}: {value:.2f}')
        else:
            print(f'  {key}: {value}')

print('\nBy Category:')
for cat, metrics in summary.get('by_category', {}).get('exact_match', {}).items():
    print(f'  {cat}: {metrics:.1%} accuracy')

# Failing tests
print('\n' + '=' * 90)
print('FAILING TESTS (semantic_sim < 0.7):')
failing = evaluator.get_failing_tests(threshold=0.7)
for test in failing:
    print(f"  {test['name']}: expected '{test['expected']}', got '{test['actual']}'")

print('\n' + '=' * 90)
print('KEY METRICS FOR PRODUCTION:')
print('  - Accuracy: Core quality metric')
print('  - Latency P95: 95th percentile response time')
print('  - Cost: Tokens/$ per request')
print('  - Regression detection: Compare against baseline')
print('  - Category breakdown: Identify weak areas')

## Interview Questions: LangChain & Production Chains

### For Senior/Staff Engineers

In [None]:
langchain_interview_questions = [
    {
        'level': 'Senior',
        'question': 'Your chain has 3 sequential LLM calls. Latency is 4.5s (1.5s each). Users complain it\'s too slow. Optimize while maintaining quality.',
        'answer': '''
**Current Architecture (Sequential):**
```
Step 1: Extract entities → 1.5s
Step 2: Classify intent → 1.5s  
Step 3: Generate response → 1.5s
Total: 4.5s
```

**Optimization Strategies:**

**1. Parallel Execution (Best: 1.5s, 70% improvement)**
```python
import asyncio

class ParallelChain:
    async def run(self, input_text: str):
        # Run independent steps in parallel
        entities_task = asyncio.create_task(self.extract_entities(input_text))
        intent_task = asyncio.create_task(self.classify_intent(input_text))
        
        # Wait for both
        entities, intent = await asyncio.gather(entities_task, intent_task)
        
        # Final step depends on previous results
        response = await self.generate_response(input_text, entities, intent)
        
        return response

# Latency: max(1.5s, 1.5s) + 1.5s = 3.0s (33% improvement)
```

**2. Cached Intermediate Results (2-3s, 30% improvement)**
```python
import hashlib
import redis

class CachedChain:
    def __init__(self):
        self.redis = redis.Redis()
        self.ttl = 3600  # 1 hour
    
    def extract_entities(self, text: str):
        # Cache key from input hash
        cache_key = f'entities:{hashlib.md5(text.encode()).hexdigest()}'
        
        # Check cache
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Call LLM
        result = llm_extract_entities(text)
        
        # Cache result
        self.redis.setex(cache_key, self.ttl, json.dumps(result))
        
        return result

# With 60% cache hit rate: 4.5s → 1.8s average
```

**3. Prompt Optimization (3.5-4s, 20% improvement)**
```python
# Combine steps where possible

# Before (3 separate calls):
# 1. Extract entities
# 2. Classify intent  
# 3. Generate response

# After (2 calls):
# 1. Extract entities + classify intent (combined)
# 2. Generate response

def combined_extraction_and_classification(text: str):
    prompt = f'''
    Analyze this text and return JSON:
    {{
      "entities": {{"people": [...], "places": [...]}},
      "intent": "question|statement|request"
    }}
    
    Text: {text}
    '''
    return llm(prompt)

# Latency: 1.5s + 1.5s = 3.0s (33% improvement)
```

**4. Streaming Responses (Perceived: <1s, Actual: same)**
```python
import openai

def stream_response(prompt: str):
    '''Stream tokens as they're generated'''
    for chunk in openai.ChatCompletion.create(
        model='gpt-4',
        messages=[{'role': 'user', 'content': prompt}],
        stream=True
    ):
        if chunk.choices[0].delta.get('content'):
            yield chunk.choices[0].delta.content

# User sees first token in ~200ms
# Much better UX even if total time is same
```

**5. Smaller Model for Simple Steps (3s, 33% improvement)**
```python
class HybridChain:
    '''Use smaller/faster models for simple tasks'''
    
    def __init__(self):
        self.fast_model = 'gpt-3.5-turbo'  # ~0.5s
        self.smart_model = 'gpt-4'  # ~1.5s
    
    def extract_entities(self, text: str):
        # Simple extraction → fast model
        return llm(text, model=self.fast_model)  # 0.5s
    
    def classify_intent(self, text: str):
        # Simple classification → fast model
        return llm(text, model=self.fast_model)  # 0.5s
    
    def generate_response(self, text: str, entities, intent):
        # Complex generation → smart model
        return llm(text, model=self.smart_model)  # 1.5s

# Total: 0.5s + 0.5s + 1.5s = 2.5s (45% improvement)
# Cost reduction: ~70% (GPT-3.5 is 10x cheaper)
```

**6. Complete Solution (Combined Approach):**
```python
class OptimizedChain:
    '''Production-optimized chain with multiple strategies'''
    
    def __init__(self):
        self.cache = redis.Redis()
        self.fast_model = 'gpt-3.5-turbo'
        self.smart_model = 'gpt-4'
    
    async def run(self, text: str) -> dict:
        # Check cache first (entire chain)
        cache_key = f'chain:{hashlib.md5(text.encode()).hexdigest()}'
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)  # ~5ms
        
        # Parallel execution of independent steps (fast model)
        entities_task = self.extract_entities_async(text)  # 0.5s
        intent_task = self.classify_intent_async(text)  # 0.5s
        
        entities, intent = await asyncio.gather(entities_task, intent_task)
        # Parallel time: max(0.5s, 0.5s) = 0.5s
        
        # Final step (smart model, streaming)
        response_stream = self.generate_response_stream(text, entities, intent)  # 1.5s
        
        # Start streaming to user immediately
        response = ''
        async for chunk in response_stream:
            response += chunk
            yield chunk  # Stream to user
        
        # Cache complete result
        result = {'entities': entities, 'intent': intent, 'response': response}
        self.cache.setex(cache_key, 3600, json.dumps(result))
        
        # Total latency: 0.5s + 1.5s = 2.0s (55% improvement)
        # With cache (60% hit rate): 0.6 * 0.005s + 0.4 * 2.0s = 0.8s average
        # User perceives: ~200ms (streaming)
```

**Results Summary:**
| Strategy | Latency | Improvement | Cost Impact |
|----------|---------|-------------|-------------|
| Baseline | 4.5s | - | $0.009 |
| Parallel | 3.0s | 33% | $0.009 |
| Cache (60% hit) | 1.8s | 60% | $0.004 |
| Hybrid models | 2.5s | 45% | $0.003 |
| **Combined** | **2.0s** | **55%** | **$0.003** |
| **+ Stream** | **0.2s perceived** | **95% perceived** | **$0.003** |

**Recommendation:**
- Implement combined solution
- Monitor cache hit rate (target: 60%+)
- Use streaming for better UX
- A/B test: measure user satisfaction vs latency
        ''',
    },
    {
        'level': 'Staff',
        'question': 'Design an evaluation system that runs continuously in production, catching regressions before they impact users. Include statistical significance testing.',
        'answer': '''
**Continuous Evaluation Architecture:**

**1. Shadow Traffic Evaluation**
```python
class ShadowEvaluator:
    '''Run new model versions on production traffic without affecting users'''
    
    def __init__(self, primary_model, shadow_model):
        self.primary = primary_model
        self.shadow = shadow_model
        self.metrics_store = MetricsStore()
    
    async def handle_request(self, request: dict) -> dict:
        # Serve from primary model
        primary_response = await self.primary.generate(request)
        
        # Shadow evaluation (async, doesn't block)
        asyncio.create_task(self._shadow_eval(request, primary_response))
        
        return primary_response
    
    async def _shadow_eval(self, request: dict, primary_response: dict):
        '''Evaluate shadow model in background'''
        try:
            # Generate shadow response
            shadow_response = await self.shadow.generate(request)
            
            # Compare metrics
            metrics = {
                'timestamp': time.time(),
                'primary_latency': primary_response['latency_ms'],
                'shadow_latency': shadow_response['latency_ms'],
                'primary_tokens': primary_response['tokens'],
                'shadow_tokens': shadow_response['tokens'],
                'semantic_similarity': self._semantic_sim(
                    primary_response['text'],
                    shadow_response['text']
                ),
            }
            
            self.metrics_store.record('shadow_eval', metrics)
            
        except Exception as e:
            self.metrics_store.record('shadow_error', {'error': str(e)})
```

**2. Golden Dataset Evaluation**
```python
class GoldenDatasetEvaluator:
    '''Continuously evaluate on curated test set'''
    
    def __init__(self, golden_dataset_path: str):
        self.golden_dataset = self.load_golden_dataset(golden_dataset_path)
        self.evaluation_interval = 3600  # 1 hour
    
    async def run_continuous_eval(self):
        '''Run evaluation every hour'''
        while True:
            results = await self.evaluate()
            
            # Check for regressions
            if self.detect_regression(results):
                await self.alert_team(results)
            
            await asyncio.sleep(self.evaluation_interval)
    
    async def evaluate(self) -> dict:
        '''Run evaluation on golden dataset'''
        results = []
        
        for test_case in self.golden_dataset:
            try:
                prediction = await self.model.generate(test_case['input'])
                
                metrics = {
                    'test_id': test_case['id'],
                    'exact_match': prediction == test_case['expected'],
                    'bleu_score': self.calculate_bleu(prediction, test_case['expected']),
                    'semantic_sim': self.semantic_similarity(prediction, test_case['expected']),
                    'latency_ms': prediction['latency_ms'],
                    'tokens': prediction['tokens'],
                }
                
                results.append(metrics)
                
            except Exception as e:
                results.append({'test_id': test_case['id'], 'error': str(e)})
        
        return self.aggregate_results(results)
```

**3. Statistical Significance Testing**
```python
import scipy.stats as stats

class StatisticalTester:
    '''Test if changes are statistically significant'''
    
    def __init__(self, min_sample_size=100, alpha=0.05):
        self.min_sample_size = min_sample_size
        self.alpha = alpha  # Significance level
    
    def compare_metrics(self, baseline: List[float], candidate: List[float]) -> dict:
        '''Compare two model versions with statistical tests'''
        
        if len(baseline) < self.min_sample_size or len(candidate) < self.min_sample_size:
            return {'status': 'insufficient_data', 'p_value': None}
        
        # T-test for means
        t_statistic, p_value = stats.ttest_ind(baseline, candidate)
        
        # Effect size (Cohen's d)
        effect_size = self.cohens_d(baseline, candidate)
        
        # Determine significance
        is_significant = p_value < self.alpha
        
        # Practical significance (effect size)
        is_practically_significant = abs(effect_size) > 0.2  # Small effect
        
        return {
            'p_value': p_value,
            'is_statistically_significant': is_significant,
            'effect_size': effect_size,
            'is_practically_significant': is_practically_significant,
            'baseline_mean': np.mean(baseline),
            'candidate_mean': np.mean(candidate),
            'relative_change': (np.mean(candidate) - np.mean(baseline)) / np.mean(baseline),
        }
    
    def cohens_d(self, group1: List[float], group2: List[float]) -> float:
        '''Calculate Cohen\'s d effect size'''
        n1, n2 = len(group1), len(group2)
        var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
        
        # Pooled standard deviation
        pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
        
        # Cohen's d
        d = (np.mean(group1) - np.mean(group2)) / pooled_std
        
        return d
```

**4. Regression Detection**
```python
class RegressionDetector:
    '''Detect regressions using multiple signals'''
    
    def __init__(self):
        self.baseline_metrics = self.load_baseline()
        self.stat_tester = StatisticalTester()
    
    def detect_regression(self, current_metrics: dict) -> bool:
        '''Check if current metrics indicate regression'''
        regressions = []
        
        for metric_name in ['accuracy', 'latency_p95', 'cost_per_query']:
            baseline = self.baseline_metrics[metric_name]
            current = current_metrics[metric_name]
            
            # Statistical test
            test_result = self.stat_tester.compare_metrics(baseline, current)
            
            # Check for regression
            if metric_name == 'accuracy':
                # Lower is bad
                if test_result['is_statistically_significant'] and test_result['relative_change'] < -0.02:
                    regressions.append({
                        'metric': metric_name,
                        'change': test_result['relative_change'],
                        'p_value': test_result['p_value'],
                    })
            
            elif metric_name in ['latency_p95', 'cost_per_query']:
                # Higher is bad
                if test_result['is_statistically_significant'] and test_result['relative_change'] > 0.1:
                    regressions.append({
                        'metric': metric_name,
                        'change': test_result['relative_change'],
                        'p_value': test_result['p_value'],
                    })
        
        return len(regressions) > 0, regressions
```

**5. Complete System**
```python
class ContinuousEvaluationSystem:
    '''Production continuous evaluation system'''
    
    def __init__(self):
        self.shadow_eval = ShadowEvaluator(primary_model, shadow_model)
        self.golden_eval = GoldenDatasetEvaluator('golden_dataset.json')
        self.regression_detector = RegressionDetector()
        self.alert_service = AlertService()
    
    async def start(self):
        '''Start continuous evaluation'''
        # Run evaluations in parallel
        await asyncio.gather(
            self.golden_eval.run_continuous_eval(),
            self.monitor_shadow_traffic(),
            self.monitor_user_feedback()
        )
    
    async def monitor_shadow_traffic(self):
        '''Monitor shadow evaluation results'''
        while True:
            # Aggregate last hour of shadow evals
            metrics = await self.shadow_eval.get_metrics(window_hours=1)
            
            # Detect regressions
            has_regression, details = self.regression_detector.detect_regression(metrics)
            
            if has_regression:
                await self.alert_service.send_alert(
                    severity='high',
                    title='Regression detected in shadow evaluation',
                    details=details
                )
            
            await asyncio.sleep(300)  # Check every 5 minutes
    
    async def monitor_user_feedback(self):
        '''Monitor implicit user feedback'''
        while True:
            feedback_metrics = await self.collect_feedback_metrics()
            
            # Metrics: thumbs up/down, session abandonment, retry rate
            if feedback_metrics['satisfaction_score'] < 0.7:
                await self.alert_service.send_alert(
                    severity='medium',
                    title='User satisfaction drop detected',
                    details=feedback_metrics
                )
            
            await asyncio.sleep(600)  # Check every 10 minutes
```

**Key Metrics to Monitor:**
1. **Accuracy**: Exact match, BLEU, semantic similarity
2. **Latency**: P50, P95, P99
3. **Cost**: Tokens per query, $ per query
4. **User satisfaction**: Thumbs up/down rate, retry rate
5. **Error rate**: Parse failures, timeouts

**Alert Conditions:**
- Accuracy drop > 2% (p < 0.05)
- Latency P95 increase > 10% (p < 0.05)
- Cost increase > 15% (p < 0.05)
- User satisfaction < 70%
- Error rate > 5%

**Best Practices:**
1. Use shadow traffic for pre-deployment testing
2. Maintain golden dataset (200-500 diverse cases)
3. Require statistical significance (p < 0.05) + practical significance (effect size > 0.2)
4. Monitor user feedback as ultimate signal
5. Automate rollback on critical regressions
        ''',
    },
]

for i, qa in enumerate(langchain_interview_questions, 1):
    print(f'\n{'=' * 100}')
    print(f'Q{i} [{qa["level"]} Level]')
    print('=' * 100)
    print(f'\n{qa["question"]}\n')
    print('ANSWER:')
    print(qa['answer'])
    print()