# Production Operations Demo

Demos aligned with Part 1B Section 4: Production Operations

1. **Observability Stack Decision** - Choosing the right monitoring tools
2. **Phoenix (Arize)** - Open source, self-hosted observability with RAG focus
3. **Langfuse Simulation** - Open source LLM tracing patterns (local demo)
4. **DeepEval** - LLM evaluation with pytest integration
5. **RAGAS** - RAG-specific evaluation metrics
6. **LLM-as-Judge** - Custom evaluation with local models
7. **Production Failure Modes** - Checklist and detection patterns

## Key Insight

Traditional APM tells you if your service is *up*. LLM observability tells you if your service is *good*.

## Architecture

```
Production LLM System
        │
        ▼
┌───────────────────┐     ┌───────────────────┐
│   Observability   │     │    Evaluation     │
│  (What happened)  │     │  (Was it good?)   │
├───────────────────┤     ├───────────────────┤
│ • Traces          │     │ • Faithfulness    │
│ • Token costs     │     │ • Relevancy       │
│ • Latency         │     │ • Hallucination   │
│ • User feedback   │     │ • LLM-as-Judge    │
└───────────────────┘     └───────────────────┘
         │                         │
         └──────────┬──────────────┘
                    ▼
            Quality Monitoring
```

## Setup

```bash
pip install arize-phoenix deepeval ragas sentence-transformers openinference-instrumentation
```

**Ollama:** `ollama pull qwen3:4b && ollama pull llama3.2:1b`


In [1]:
# Setup: Environment, logging, and Ollama client
import subprocess
import logging
import requests
import re
import time
import json
from typing import List, Dict, Any, Tuple, Optional
from dataclasses import dataclass, field
from datetime import datetime

# Color-coded logging
class ColoredFormatter(logging.Formatter):
    COLORS = {'DEBUG': '\033[90m', 'INFO': '\033[92m', 'WARNING': '\033[93m', 'ERROR': '\033[91m', 'RESET': '\033[0m'}
    def format(self, record):
        color = self.COLORS.get(record.levelname, self.COLORS['RESET'])
        record.msg = f"{color}[{record.levelname}]{self.COLORS['RESET']} {record.msg}"
        return super().format(record)

logger = logging.getLogger("production_ops_demo")
logger.setLevel(logging.DEBUG)
if not logger.handlers:
    handler = logging.StreamHandler()
    handler.setFormatter(ColoredFormatter('%(message)s'))
    logger.addHandler(handler)

# Check Ollama
def check_ollama():
    try:
        result = subprocess.run(["ollama", "list"], capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            logger.info("Ollama is running")
            print(f"\nAvailable models:\n{result.stdout}")
            return True
    except Exception as e:
        logger.error(f"Ollama check failed: {e}")
    return False

ollama_ready = check_ollama()
OLLAMA_URL = "http://localhost:11434"
EVAL_MODEL = "qwen3:4b"    # For LLM-as-judge
WEAK_MODEL = "llama3.2:1b" # For basic generation

# Ollama helpers
def ollama_generate(prompt: str, model: str, temperature: float = 0.7) -> Tuple[str, float]:
    start = time.perf_counter()
    response = requests.post(
        f"{OLLAMA_URL}/api/generate",
        json={"model": model, "prompt": prompt, "stream": False, "options": {"temperature": temperature}},
        timeout=120
    )
    latency_ms = (time.perf_counter() - start) * 1000
    return response.json().get("response", ""), latency_ms

def clean_response(text: str) -> str:
    return re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()

logger.info("Helper functions ready")


[92m[INFO][0m Ollama is running
[92m[INFO][0m Helper functions ready



Available models:
NAME           ID              SIZE      MODIFIED   
qwen3:4b       359d7dd4bcda    2.5 GB    2 days ago    
llama3.2:1b    baf6a787fdff    1.3 GB    2 days ago    



---

# Demo 1: Observability Stack Decision Framework

LLM observability differs from traditional APM. Key dimensions:

| Dimension | Traditional APM | LLM Observability |
|-----------|-----------------|-------------------|
| Latency | Response time | TTFT, generation time, tool calls |
| Errors | HTTP codes | Hallucinations, refusals, toxicity |
| Costs | Compute/storage | Token economics (input/output) |
| Quality | N/A | User feedback, LLM-as-judge scores |


In [2]:
# Demo 1: Observability Decision Framework

print("="*65)
print("OBSERVABILITY: Stack Selection Decision Framework")
print("="*65)

OBSERVABILITY_DECISION = """
LLM Observability Stack Selection
==================================

DECISION TREE:

1. Are you using LangChain?
   YES → Start with LangSmith (zero-config integration)
   NO → Continue to #2

2. Do you need self-hosting (GDPR, data sovereignty)?
   YES → Langfuse (MIT license) or Phoenix (Apache 2.0)
   NO → Continue to #3

3. Do you have existing observability infrastructure?
   Datadog → Use Datadog LLM Monitoring (unified stack)
   New Relic → Use New Relic AI Monitoring
   Neither → Continue to #4

4. What's your primary use case?
   RAG/Retrieval → Phoenix by Arize (RAG-specific features)
   Agents → Langfuse or LangSmith (trace visualization)
   Cost tracking → Helicone (fastest setup)
   Evaluation focus → Braintrust (eval + observability)
"""

print(OBSERVABILITY_DECISION)

# Tool comparison matrix
tools = [
    {"name": "Langfuse", "deployment": "Cloud/Self-host", "best_for": "General OSS", "pricing": "Free tier", "license": "MIT", "eu_friendly": "✓"},
    {"name": "Phoenix", "deployment": "Self-host only", "best_for": "RAG, evals", "pricing": "Free (OSS)", "license": "Apache 2.0", "eu_friendly": "✓"},
    {"name": "LangSmith", "deployment": "Cloud", "best_for": "LangChain", "pricing": "Free tier", "license": "Proprietary", "eu_friendly": "~"},
    {"name": "Helicone", "deployment": "Cloud", "best_for": "Cost tracking", "pricing": "Free tier", "license": "Proprietary", "eu_friendly": "~"},
    {"name": "Opik", "deployment": "Cloud/Self-host", "best_for": "Speed", "pricing": "Free tier", "license": "Apache 2.0", "eu_friendly": "✓"},
    {"name": "Datadog", "deployment": "Cloud", "best_for": "Enterprise", "pricing": "Enterprise", "license": "Proprietary", "eu_friendly": "✓"},
]

print("\nTOOL COMPARISON:")
print("─"*75)
print(f"{'Tool':<12} {'Deployment':<15} {'Best For':<14} {'Pricing':<12} {'EU OK'}")
print("─"*75)
for t in tools:
    print(f"{t['name']:<12} {t['deployment']:<15} {t['best_for']:<14} {t['pricing']:<12} {t['eu_friendly']}")

print("\n" + "═"*65)
print("EU/GDPR CONSIDERATIONS")
print("═"*65)
print("""
For EU data sovereignty:

1. SELF-HOSTED (Full control):
   • Langfuse - MIT license, Docker/K8s, PostgreSQL backend
   • Phoenix - Apache 2.0, runs locally, no external deps
   • Opik - Apache 2.0, self-host option

2. EU-HOSTED CLOUD:
   • Langfuse Cloud EU (eu.cloud.langfuse.com)
   • Datadog EU (datadoghq.eu)

3. KEY REQUIREMENTS:
   • Data stays in EU region
   • No transatlantic data transfers
   • Audit logs for compliance
   • PII masking before logging
""")


OBSERVABILITY: Stack Selection Decision Framework

LLM Observability Stack Selection

DECISION TREE:

1. Are you using LangChain?
   YES → Start with LangSmith (zero-config integration)
   NO → Continue to #2

2. Do you need self-hosting (GDPR, data sovereignty)?
   YES → Langfuse (MIT license) or Phoenix (Apache 2.0)
   NO → Continue to #3

3. Do you have existing observability infrastructure?
   Datadog → Use Datadog LLM Monitoring (unified stack)
   New Relic → Use New Relic AI Monitoring
   Neither → Continue to #4

4. What's your primary use case?
   RAG/Retrieval → Phoenix by Arize (RAG-specific features)
   Agents → Langfuse or LangSmith (trace visualization)
   Cost tracking → Helicone (fastest setup)
   Evaluation focus → Braintrust (eval + observability)


TOOL COMPARISON:
───────────────────────────────────────────────────────────────────────────
Tool         Deployment      Best For       Pricing      EU OK
───────────────────────────────────────────────────────────────────

---

# Demo 2: Phoenix - Self-Hosted RAG Observability

Phoenix by Arize is an open-source (Apache 2.0) observability tool with:
- **Local execution** - No data leaves your machine
- **RAG-specific** - Retrieval quality metrics built-in
- **Trace visualization** - Multi-step LLM flow debugging
- **Evaluation** - LLM-as-judge integration


In [3]:
# Demo 2: Phoenix Setup and Core Concepts

print("="*65)
print("PHOENIX: Open Source LLM Observability")
print("="*65)

phoenix_available = False
try:
    import phoenix as px
    phoenix_available = True
    print("\n✓ Phoenix installed")
except ImportError:
    print("\n⚠ Phoenix not installed")
    print("  pip install arize-phoenix openinference-instrumentation")

print("""
PHOENIX ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌──────────────┐     ┌──────────────────────────────────────┐
  │  Your App    │────▶│         Phoenix Server               │
  │  (LLM calls) │     │  (localhost:6006)                    │
  └──────────────┘     ├──────────────────────────────────────┤
         │             │  • Trace Collection                  │
         │             │  • Span Visualization                │
    OpenTelemetry      │  • RAG Metrics (MRR, NDCG)          │
    Instrumentation    │  • LLM-as-Judge Evals               │
         │             │  • Prompt Playground                 │
         ▼             └──────────────────────────────────────┘
  ┌──────────────┐
  │   Traces     │     DATA STAYS LOCAL (GDPR-friendly)
  │   (OTLP)     │
  └──────────────┘

KEY FEATURES:
─────────────────────────────────────────────────────────────────

  1. TRACING
     • Auto-instrumentation for OpenAI, Anthropic, LangChain
     • Multi-step flow visualization
     • Latency breakdown per span

  2. RAG ANALYSIS
     • Retrieval precision/recall
     • Context relevance scoring
     • Query-document embeddings visualization

  3. EVALUATION
     • Built-in LLM-as-judge templates
     • Custom evaluation criteria
     • Experiment tracking

  4. PROMPT ENGINEERING
     • Prompt playground
     • A/B testing prompts
     • Version tracking
""")


PHOENIX: Open Source LLM Observability


  from .autonotebook import tqdm as notebook_tqdm



✓ Phoenix installed

PHOENIX ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌──────────────┐     ┌──────────────────────────────────────┐
  │  Your App    │────▶│         Phoenix Server               │
  │  (LLM calls) │     │  (localhost:6006)                    │
  └──────────────┘     ├──────────────────────────────────────┤
         │             │  • Trace Collection                  │
         │             │  • Span Visualization                │
    OpenTelemetry      │  • RAG Metrics (MRR, NDCG)          │
    Instrumentation    │  • LLM-as-Judge Evals               │
         │             │  • Prompt Playground                 │
         ▼             └──────────────────────────────────────┘
  ┌──────────────┐
  │   Traces     │     DATA STAYS LOCAL (GDPR-friendly)
  │   (OTLP)     │
  └──────────────┘

KEY FEATURES:
─────────────────────────────────────────────────────────────────

  1. TRACING
     • Auto-instrumentation for OpenAI, Ant

In [4]:
# Demo 2b: Phoenix Tracing Simulation (without full server)

print("="*65)
print("PHOENIX: Tracing Pattern Demo (Local Simulation)")
print("="*65)

# Simulate what Phoenix traces would look like
# In production, you'd use OpenTelemetry instrumentation

@dataclass
class Span:
    """Simulated OpenTelemetry span for LLM observability."""
    name: str
    span_type: str  # "llm", "retrieval", "tool", "chain"
    start_time: float
    end_time: float = 0
    attributes: Dict[str, Any] = field(default_factory=dict)
    events: List[Dict] = field(default_factory=list)
    parent_id: str = None
    span_id: str = field(default_factory=lambda: f"span_{int(time.time()*1000)}")
    
    @property
    def duration_ms(self) -> float:
        return (self.end_time - self.start_time) * 1000

@dataclass
class Trace:
    """Collection of spans representing an LLM operation."""
    trace_id: str
    spans: List[Span] = field(default_factory=list)
    
    def add_span(self, span: Span):
        self.spans.append(span)
    
    def summary(self) -> Dict:
        return {
            "trace_id": self.trace_id,
            "total_spans": len(self.spans),
            "total_duration_ms": sum(s.duration_ms for s in self.spans),
            "span_types": list({s.span_type for s in self.spans})
        }

class LocalTracer:
    """Simulated tracer demonstrating Phoenix patterns."""
    
    def __init__(self):
        self.traces: List[Trace] = []
        self.current_trace: Optional[Trace] = None
    
    def start_trace(self, name: str) -> Trace:
        trace = Trace(trace_id=f"trace_{int(time.time()*1000)}")
        self.current_trace = trace
        self.traces.append(trace)
        return trace
    
    def start_span(self, name: str, span_type: str, attributes: Dict = None) -> Span:
        span = Span(
            name=name,
            span_type=span_type,
            start_time=time.perf_counter(),
            attributes=attributes or {}
        )
        if self.current_trace:
            self.current_trace.add_span(span)
        return span
    
    def end_span(self, span: Span, output: Any = None):
        span.end_time = time.perf_counter()
        if output:
            span.attributes["output"] = str(output)[:200]

# Demo: Trace a RAG query
print("\n[Simulated RAG Trace]")
print("─"*65)

tracer = LocalTracer()

def traced_rag_query(query: str) -> Dict:
    """Simulate a RAG query with full tracing."""
    
    # Start trace
    trace = tracer.start_trace("rag_query")
    
    # Span 1: Embedding
    embed_span = tracer.start_span("embed_query", "embedding", {
        "model": "all-MiniLM-L6-v2",
        "input_text": query
    })
    time.sleep(0.01)  # Simulate embedding
    tracer.end_span(embed_span, output="[0.1, 0.2, ...384 dims]")
    
    # Span 2: Retrieval
    retrieve_span = tracer.start_span("retrieve_docs", "retrieval", {
        "top_k": 3,
        "index": "knowledge_base"
    })
    time.sleep(0.02)  # Simulate retrieval
    retrieved_docs = ["Doc 1: Python basics...", "Doc 2: Data types..."]
    retrieve_span.attributes["num_docs"] = len(retrieved_docs)
    tracer.end_span(retrieve_span, output=retrieved_docs)
    
    # Span 3: LLM Generation
    llm_span = tracer.start_span("llm_generate", "llm", {
        "model": WEAK_MODEL,
        "temperature": 0.7,
        "input_tokens": 150,
        "context_chunks": len(retrieved_docs)
    })
    
    if ollama_ready:
        context = "\n".join(retrieved_docs)
        prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer briefly:"
        response, latency = ollama_generate(prompt, WEAK_MODEL)
        response = clean_response(response)
        llm_span.attributes["output_tokens"] = int(len(response.split()) * 1.3)
        llm_span.attributes["latency_ms"] = latency
    else:
        response = "[LLM response would appear here]"
        time.sleep(0.1)
    
    tracer.end_span(llm_span, output=response)
    
    return {
        "query": query,
        "response": response,
        "trace": trace.summary()
    }

# Execute traced query
result = traced_rag_query("What is Python?")

print(f"  Query: {result['query']}")
print(f"  Response: {result['response'][:100]}...")
print(f"\n  Trace Summary:")
for k, v in result['trace'].items():
    print(f"    {k}: {v}")

print("\n  Span Details:")
for span in tracer.traces[-1].spans:
    print(f"    [{span.span_type:10}] {span.name}: {span.duration_ms:.1f}ms")


PHOENIX: Tracing Pattern Demo (Local Simulation)

[Simulated RAG Trace]
─────────────────────────────────────────────────────────────────
  Query: What is Python?
  Response: Python is a high-level, interpreted programming language known for its simplicity, readability, and ...

  Trace Summary:
    trace_id: trace_1767393196344
    total_spans: 3
    total_duration_ms: 1855.29420900275
    span_types: ['embedding', 'retrieval', 'llm']

  Span Details:
    [embedding ] embed_query: 12.5ms
    [retrieval ] retrieve_docs: 25.0ms
    [llm       ] llm_generate: 1817.8ms


In [5]:
# Demo 2c: Phoenix Setup Guide

print("="*65)
print("PHOENIX: Setup Guide")
print("="*65)

print("""
1. INSTALLATION:
─────────────────────────────────────────────────────────────────

   pip install arize-phoenix openinference-instrumentation-openai

2. START SERVER:
─────────────────────────────────────────────────────────────────

   # Terminal 1: Start Phoenix
   python -m phoenix.server.main serve
   
   # Opens at http://localhost:6006

3. INSTRUMENT YOUR APP:
─────────────────────────────────────────────────────────────────

   from phoenix.otel import register
   from openinference.instrumentation.openai import OpenAIInstrumentor
   
   # Connect to Phoenix
   tracer_provider = register(
       project_name="my-llm-app",
       endpoint="http://localhost:6006/v1/traces"
   )
   
   # Auto-instrument OpenAI calls
   OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
   
   # Now all OpenAI calls are automatically traced!
   from openai import OpenAI
   client = OpenAI()
   response = client.chat.completions.create(
       model="gpt-4o-mini",
       messages=[{"role": "user", "content": "Hello!"}]
   )

4. VIEW TRACES:
─────────────────────────────────────────────────────────────────

   Open http://localhost:6006 to see:
   • All traces with latency
   • Token usage per request
   • Span hierarchy (retrieval → LLM → etc.)
   • Error rates and types

5. RUN EVALUATIONS:
─────────────────────────────────────────────────────────────────

   from phoenix.evals import (
       HallucinationEvaluator,
       RelevanceEvaluator,
       run_evals
   )
   
   # Evaluate traces
   hallucination_eval = HallucinationEvaluator(model=eval_model)
   relevance_eval = RelevanceEvaluator(model=eval_model)
   
   results = run_evals(
       dataframe=traces_df,
       evaluators=[hallucination_eval, relevance_eval]
   )
""")


PHOENIX: Setup Guide

1. INSTALLATION:
─────────────────────────────────────────────────────────────────

   pip install arize-phoenix openinference-instrumentation-openai

2. START SERVER:
─────────────────────────────────────────────────────────────────

   # Terminal 1: Start Phoenix
   python -m phoenix.server.main serve

   # Opens at http://localhost:6006

3. INSTRUMENT YOUR APP:
─────────────────────────────────────────────────────────────────

   from phoenix.otel import register
   from openinference.instrumentation.openai import OpenAIInstrumentor

   # Connect to Phoenix
   tracer_provider = register(
       project_name="my-llm-app",
       endpoint="http://localhost:6006/v1/traces"
   )

   # Auto-instrument OpenAI calls
   OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

   # Now all OpenAI calls are automatically traced!
   from openai import OpenAI
   client = OpenAI()
   response = client.chat.completions.create(
       model="gpt-4o-mini",
       mess

---

# Demo 3: Langfuse Patterns (Local Simulation)

Langfuse is the most popular open-source LLM observability platform:
- **19K+ GitHub stars**
- **MIT license** - Full control
- **EU-hosted cloud** option (eu.cloud.langfuse.com)
- **Self-hosting** with Docker

We'll demonstrate the patterns locally without requiring cloud setup.


In [6]:
# Demo 3: Langfuse Pattern Simulation

print("="*65)
print("LANGFUSE: LLM Observability Patterns")
print("="*65)

# Simulate Langfuse's decorator-based tracing pattern
# In production, you'd use: from langfuse.decorators import observe

class LangfuseSimulator:
    """Simulates Langfuse tracing to demonstrate patterns."""
    
    def __init__(self):
        self.traces = []
        self.current_trace_id = None
        self.observations = []
    
    def trace(self, name: str, user_id: str = None, metadata: Dict = None):
        trace_id = f"trace_{datetime.now().strftime('%H%M%S%f')}"
        self.current_trace_id = trace_id
        trace_data = {
            "id": trace_id,
            "name": name,
            "user_id": user_id,
            "metadata": metadata or {},
            "start_time": datetime.now().isoformat(),
            "observations": []
        }
        self.traces.append(trace_data)
        return trace_data
    
    def generation(self, name: str, model: str, input_text: str, 
                   output: str = None, usage: Dict = None):
        observation = {
            "type": "generation",
            "name": name,
            "model": model,
            "input": input_text[:500],
            "output": output[:500] if output else None,
            "usage": usage or {},
            "timestamp": datetime.now().isoformat()
        }
        if self.traces:
            self.traces[-1]["observations"].append(observation)
        return observation
    
    def score(self, trace_id: str, name: str, value: float, comment: str = None):
        """Add evaluation score to a trace."""
        score_data = {
            "trace_id": trace_id,
            "name": name,
            "value": value,
            "comment": comment
        }
        for trace in self.traces:
            if trace["id"] == trace_id:
                if "scores" not in trace:
                    trace["scores"] = []
                trace["scores"].append(score_data)
        return score_data

# Demo the patterns
lf = LangfuseSimulator()

print("\n[Traced Support Ticket Processing]")
print("─"*65)

def process_support_ticket(ticket: str, customer_id: str):
    """Process support ticket with Langfuse-style tracing."""
    
    # Start trace
    trace = lf.trace(
        name="support_ticket",
        user_id=customer_id,
        metadata={"channel": "web", "priority": "normal"}
    )
    
    # Simulate LLM generation
    if ollama_ready:
        prompt = f"Classify this support ticket into: billing, technical, general.\n\nTicket: {ticket}\n\nCategory:"
        response, latency = ollama_generate(prompt, WEAK_MODEL, temperature=0.3)
        response = clean_response(response)
    else:
        response = "technical"
        latency = 50
    
    # Log generation
    lf.generation(
        name="classify_ticket",
        model=WEAK_MODEL,
        input_text=ticket,
        output=response,
        usage={"input_tokens": len(ticket.split()), "output_tokens": len(response.split()), "latency_ms": latency}
    )
    
    # Add quality score
    lf.score(
        trace_id=trace["id"],
        name="classification_confidence",
        value=0.85,
        comment="High confidence classification"
    )
    
    return {"category": response.strip().lower(), "trace_id": trace["id"]}

# Process a ticket
result = process_support_ticket(
    "The app keeps crashing when I try to upload files",
    customer_id="cust_12345"
)

print(f"  Category: {result['category'][:50]}")
print(f"  Trace ID: {result['trace_id']}")

# Show trace data
print("\n  Trace Data (what Langfuse stores):")
trace_data = lf.traces[-1]
print(f"    Name: {trace_data['name']}")
print(f"    User ID: {trace_data['user_id']}")
print(f"    Observations: {len(trace_data['observations'])}")
if trace_data.get('scores'):
    print(f"    Scores: {trace_data['scores'][0]['name']} = {trace_data['scores'][0]['value']}")


LANGFUSE: LLM Observability Patterns

[Traced Support Ticket Processing]
─────────────────────────────────────────────────────────────────
  Category: i'd classify the support ticket as "technical". th
  Trace ID: trace_040318231536

  Trace Data (what Langfuse stores):
    Name: support_ticket
    User ID: cust_12345
    Observations: 1
    Scores: classification_confidence = 0.85


In [7]:
# Demo 3b: Langfuse Setup Guide

print("="*65)
print("LANGFUSE: Setup Guide")
print("="*65)

print("""
1. CLOUD SETUP (Quickest):
─────────────────────────────────────────────────────────────────

   # Sign up at https://cloud.langfuse.com (or eu.cloud.langfuse.com for EU)
   # Get API keys from project settings
   
   export LANGFUSE_PUBLIC_KEY="pk-..."
   export LANGFUSE_SECRET_KEY="sk-..."
   export LANGFUSE_HOST="https://cloud.langfuse.com"  # or eu.cloud...

2. SELF-HOSTED (Data Sovereignty):
─────────────────────────────────────────────────────────────────

   # docker-compose.yml
   services:
     langfuse:
       image: langfuse/langfuse:latest
       ports:
         - "3000:3000"
       environment:
         - DATABASE_URL=postgresql://...
         - NEXTAUTH_SECRET=your-secret
         - SALT=your-salt
   
   # Then: docker-compose up -d

3. INTEGRATION OPTIONS:
─────────────────────────────────────────────────────────────────

   pip install langfuse
   
   # Option A: Decorators (cleanest)
   from langfuse.decorators import observe
   
   @observe()
   def my_llm_function(query: str):
       # Automatically traces inputs, outputs, latency
       return llm.complete(query)
   
   # Option B: OpenAI wrapper (drop-in replacement)
   from langfuse.openai import OpenAI
   client = OpenAI()  # Auto-traces all calls
   
   # Option C: LangChain callback
   from langfuse.callback import CallbackHandler
   handler = CallbackHandler()
   chain.invoke(..., config={"callbacks": [handler]})

4. ADDING SCORES:
─────────────────────────────────────────────────────────────────

   from langfuse import Langfuse
   
   langfuse = Langfuse()
   
   # Score a trace programmatically
   langfuse.score(
       trace_id="trace-xxx",
       name="quality",
       value=0.9,
       comment="User thumbs up"
   )
   
   # Or use LLM-as-judge in dashboard → Evaluation tab

5. KEY METRICS TO TRACK:
─────────────────────────────────────────────────────────────────

   • Latency (P50, P95, P99)
   • Token usage and costs
   • Error rates by type
   • Quality scores over time
   • User satisfaction (thumbs up/down)
""")


LANGFUSE: Setup Guide

1. CLOUD SETUP (Quickest):
─────────────────────────────────────────────────────────────────

   # Sign up at https://cloud.langfuse.com (or eu.cloud.langfuse.com for EU)
   # Get API keys from project settings

   export LANGFUSE_PUBLIC_KEY="pk-..."
   export LANGFUSE_SECRET_KEY="sk-..."
   export LANGFUSE_HOST="https://cloud.langfuse.com"  # or eu.cloud...

2. SELF-HOSTED (Data Sovereignty):
─────────────────────────────────────────────────────────────────

   # docker-compose.yml
   services:
     langfuse:
       image: langfuse/langfuse:latest
       ports:
         - "3000:3000"
       environment:
         - DATABASE_URL=postgresql://...
         - NEXTAUTH_SECRET=your-secret
         - SALT=your-salt

   # Then: docker-compose up -d

3. INTEGRATION OPTIONS:
─────────────────────────────────────────────────────────────────

   pip install langfuse

   # Option A: Decorators (cleanest)
   from langfuse.decorators import observe

   @observe()
   def my_llm_

---

# Demo 4: DeepEval - LLM Evaluation Framework

DeepEval is an open-source evaluation framework with:
- **Pytest integration** - Run evals in CI/CD
- **Built-in metrics** - Faithfulness, relevancy, hallucination
- **Custom metrics** - G-Eval for any criteria
- **~80% agreement** with human judgment

Key metrics:
- **Faithfulness**: Is response grounded in context?
- **Answer Relevancy**: Does it answer the question?
- **Contextual Precision**: Are retrieved docs relevant?


In [8]:
# Demo 4: DeepEval Overview and Local Simulation

print("="*65)
print("DEEPEVAL: LLM Evaluation Framework")
print("="*65)

deepeval_available = False
try:
    import deepeval
    deepeval_available = True
    print("\n✓ DeepEval installed")
except ImportError:
    print("\n⚠ DeepEval not installed")
    print("  pip install deepeval")

print("""
DEEPEVAL ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌─────────────────┐
  │   Test Case     │
  │  ─────────────  │
  │  • input        │     ┌──────────────┐
  │  • actual_output│────▶│   Metrics    │────▶ Pass/Fail
  │  • context      │     │  (LLM-judge) │
  │  • expected     │     └──────────────┘
  └─────────────────┘

BUILT-IN METRICS:
─────────────────────────────────────────────────────────────────

  RETRIEVAL METRICS:
    • ContextualPrecisionMetric  - Are retrieved docs relevant?
    • ContextualRecallMetric     - Did we get all relevant docs?
    • ContextualRelevancyMetric  - How relevant is the context?
    
  GENERATION METRICS:
    • FaithfulnessMetric    - Is response grounded in context?
    • AnswerRelevancyMetric - Does it answer the question?
    • HallucinationMetric   - Did the model make things up?
    
  SAFETY METRICS:
    • ToxicityMetric  - Is the response toxic?
    • BiasMetric      - Does it show bias?
""")


DEEPEVAL: LLM Evaluation Framework

✓ DeepEval installed

DEEPEVAL ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌─────────────────┐
  │   Test Case     │
  │  ─────────────  │
  │  • input        │     ┌──────────────┐
  │  • actual_output│────▶│   Metrics    │────▶ Pass/Fail
  │  • context      │     │  (LLM-judge) │
  │  • expected     │     └──────────────┘
  └─────────────────┘

BUILT-IN METRICS:
─────────────────────────────────────────────────────────────────

  RETRIEVAL METRICS:
    • ContextualPrecisionMetric  - Are retrieved docs relevant?
    • ContextualRecallMetric     - Did we get all relevant docs?
    • ContextualRelevancyMetric  - How relevant is the context?

  GENERATION METRICS:
    • FaithfulnessMetric    - Is response grounded in context?
    • AnswerRelevancyMetric - Does it answer the question?
    • HallucinationMetric   - Did the model make things up?

  SAFETY METRICS:
    • ToxicityMetric  - Is the response toxic?
    • 

In [9]:
# Demo 4b: DeepEval Pattern Simulation with Local LLM

print("="*65)
print("DEEPEVAL: Local Evaluation Demo (Ollama as Judge)")
print("="*65)

# Simulate DeepEval's evaluation pattern using Ollama as the judge

@dataclass
class EvalTestCase:
    """Simulates DeepEval's LLMTestCase."""
    input: str                        # User query
    actual_output: str                # Generated response
    retrieval_context: List[str]      # Retrieved documents
    expected_output: str = None       # Ground truth (optional)

@dataclass
class EvalResult:
    """Result of a metric evaluation."""
    metric_name: str
    score: float
    passed: bool
    reason: str

class LocalEvaluator:
    """Simulates DeepEval metrics using local Ollama models."""
    
    def __init__(self, model: str = EVAL_MODEL, threshold: float = 0.7):
        self.model = model
        self.threshold = threshold
    
    def evaluate_faithfulness(self, test_case: EvalTestCase) -> EvalResult:
        """
        Faithfulness: Is the response grounded in the provided context?
        Score 0-1, higher = more faithful to source.
        """
        context_str = "\n".join(test_case.retrieval_context)
        
        prompt = f"""You are an evaluation judge. Rate the faithfulness of the response.
Faithfulness = Is every claim in the response supported by the context?

CONTEXT:
{context_str}

RESPONSE TO EVALUATE:
{test_case.actual_output}

Rate faithfulness from 0.0 (completely unfaithful/hallucinated) to 1.0 (fully grounded).
Output ONLY a JSON: {{"score": <number>, "reason": "<brief explanation>"}}"""
        
        if ollama_ready:
            raw_response, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw_response = clean_response(raw_response)
            try:
                # Extract JSON from response
                import json
                json_match = re.search(r'\{[^}]+\}', raw_response)
                if json_match:
                    result = json.loads(json_match.group())
                    score = float(result.get("score", 0.5))
                    reason = result.get("reason", "No reason provided")
                else:
                    score = 0.5
                    reason = "Could not parse response"
            except:
                score = 0.5
                reason = "Parse error"
        else:
            score = 0.8
            reason = "Simulated evaluation"
        
        return EvalResult(
            metric_name="faithfulness",
            score=score,
            passed=score >= self.threshold,
            reason=reason
        )
    
    def evaluate_relevancy(self, test_case: EvalTestCase) -> EvalResult:
        """
        Answer Relevancy: Does the response answer the question?
        """
        prompt = f"""You are an evaluation judge. Rate how well the response answers the question.

QUESTION:
{test_case.input}

RESPONSE:
{test_case.actual_output}

Rate relevancy from 0.0 (completely irrelevant) to 1.0 (perfectly answers the question).
Output ONLY a JSON: {{"score": <number>, "reason": "<brief explanation>"}}"""
        
        if ollama_ready:
            raw_response, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw_response = clean_response(raw_response)
            try:
                json_match = re.search(r'\{[^}]+\}', raw_response)
                if json_match:
                    result = json.loads(json_match.group())
                    score = float(result.get("score", 0.5))
                    reason = result.get("reason", "No reason provided")
                else:
                    score = 0.5
                    reason = "Could not parse response"
            except:
                score = 0.5
                reason = "Parse error"
        else:
            score = 0.9
            reason = "Simulated evaluation"
        
        return EvalResult(
            metric_name="answer_relevancy",
            score=score,
            passed=score >= self.threshold,
            reason=reason
        )

# Create test cases
print("\n[Creating Test Cases]")
print("─"*65)

test_cases = [
    EvalTestCase(
        input="What is the return policy?",
        actual_output="You can return items within 30 days for a full refund. Items must be unused and in original packaging.",
        retrieval_context=[
            "Return Policy: Items may be returned within 30 days of purchase.",
            "Refunds: Full refunds are issued for unused items in original packaging.",
            "Shipping: Free shipping on orders over $50."
        ]
    ),
    EvalTestCase(
        input="How do I contact support?",
        actual_output="You can reach our support team 24/7 via email at support@company.com or call 1-800-SUPPORT.",
        retrieval_context=[
            "Contact: Email us at help@company.com",
            "Hours: Monday-Friday 9am-5pm",
        ]
    ),
]

for i, tc in enumerate(test_cases):
    print(f"\n  Test Case {i+1}:")
    print(f"    Question: {tc.input}")
    print(f"    Response: {tc.actual_output[:60]}...")
    print(f"    Context chunks: {len(tc.retrieval_context)}")


DEEPEVAL: Local Evaluation Demo (Ollama as Judge)

[Creating Test Cases]
─────────────────────────────────────────────────────────────────

  Test Case 1:
    Question: What is the return policy?
    Response: You can return items within 30 days for a full refund. Items...
    Context chunks: 3

  Test Case 2:
    Question: How do I contact support?
    Response: You can reach our support team 24/7 via email at support@com...
    Context chunks: 2


In [10]:
# Demo 4c: Run Evaluations

print("="*65)
print("DEEPEVAL: Running Evaluations")
print("="*65)

evaluator = LocalEvaluator(model=EVAL_MODEL, threshold=0.7)

print("\n[Evaluation Results]")
print("─"*65)

all_results = []
for i, tc in enumerate(test_cases):
    print(f"\n  Test Case {i+1}: {tc.input}")
    
    # Faithfulness
    faith_result = evaluator.evaluate_faithfulness(tc)
    all_results.append(faith_result)
    status = "✓ PASS" if faith_result.passed else "✗ FAIL"
    print(f"    Faithfulness:  {faith_result.score:.2f} [{status}]")
    print(f"      Reason: {faith_result.reason[:60]}...")
    
    # Relevancy
    rel_result = evaluator.evaluate_relevancy(tc)
    all_results.append(rel_result)
    status = "✓ PASS" if rel_result.passed else "✗ FAIL"
    print(f"    Relevancy:     {rel_result.score:.2f} [{status}]")
    print(f"      Reason: {rel_result.reason[:60]}...")

# Summary
print("\n" + "═"*65)
print("EVALUATION SUMMARY")
print("═"*65)
passed = sum(1 for r in all_results if r.passed)
total = len(all_results)
print(f"\n  Passed: {passed}/{total} ({passed/total*100:.0f}%)")
print(f"  Threshold: {evaluator.threshold}")

# Aggregate by metric
from collections import defaultdict
by_metric = defaultdict(list)
for r in all_results:
    by_metric[r.metric_name].append(r.score)

print("\n  Average Scores:")
for metric, scores in by_metric.items():
    avg = sum(scores) / len(scores)
    print(f"    {metric}: {avg:.2f}")


DEEPEVAL: Running Evaluations

[Evaluation Results]
─────────────────────────────────────────────────────────────────

  Test Case 1: What is the return policy?
    Faithfulness:  1.00 [✓ PASS]
      Reason: Response matches context exactly...
    Relevancy:     1.00 [✓ PASS]
      Reason: The response directly and completely answers the question by...

  Test Case 2: How do I contact support?
    Faithfulness:  0.00 [✗ FAIL]
      Reason: Response claims 24/7 availability, email support@company.com...
    Relevancy:     1.00 [✓ PASS]
      Reason: Directly provides clear contact methods (email and phone) fo...

═════════════════════════════════════════════════════════════════
EVALUATION SUMMARY
═════════════════════════════════════════════════════════════════

  Passed: 3/4 (75%)
  Threshold: 0.7

  Average Scores:
    faithfulness: 0.50
    answer_relevancy: 1.00


---

# Demo 5: RAGAS - RAG-Specific Evaluation

RAGAS (RAG Assessment) is designed specifically for evaluating RAG pipelines:
- **Component-level** - Retriever and generator metrics
- **End-to-end** - Full pipeline quality
- **Reference-free** - Works without ground truth

Core Metrics:
| Metric | Measures | Range |
|--------|----------|-------|
| Context Precision | Are retrieved docs relevant to question? | 0-1 |
| Context Recall | Did we retrieve all needed info? | 0-1 |
| Faithfulness | Is answer grounded in context? | 0-1 |
| Answer Relevancy | Does answer address the question? | 0-1 |


In [11]:
# Demo 5: RAGAS Overview

print("="*65)
print("RAGAS: RAG Assessment Framework")
print("="*65)

ragas_available = False
try:
    import ragas
    ragas_available = True
    print("\n✓ RAGAS installed")
except ImportError:
    print("\n⚠ RAGAS not installed")
    print("  pip install ragas")

print("""
RAGAS ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  RAG Pipeline Output
         │
         ▼
  ┌─────────────────────────────────────────────────────┐
  │                    RAGAS Evaluation                  │
  ├─────────────────────────────────────────────────────┤
  │                                                      │
  │  RETRIEVER METRICS           GENERATOR METRICS       │
  │  ─────────────────           ─────────────────       │
  │  • Context Precision         • Faithfulness          │
  │  • Context Recall            • Answer Relevancy      │
  │  • Context Relevancy         • Answer Correctness    │
  │                                                      │
  │                 END-TO-END                           │
  │                 ──────────                           │
  │                 • Answer Similarity                  │
  │                 • Answer Semantic Similarity         │
  │                                                      │
  └─────────────────────────────────────────────────────┘
                         │
                         ▼
                  Quality Scores (0-1)

RAGAS vs DEEPEVAL:
─────────────────────────────────────────────────────────────────

  ┌─────────────────┬────────────────────┬────────────────────┐
  │ Feature         │ RAGAS              │ DeepEval           │
  ├─────────────────┼────────────────────┼────────────────────┤
  │ Focus           │ RAG-specific       │ General LLM        │
  │ Metrics         │ RAG-optimized      │ Broader coverage   │
  │ Integration     │ LangChain native   │ Pytest native      │
  │ CI/CD           │ Manual             │ Built-in           │
  │ Custom metrics  │ Limited            │ G-Eval support     │
  └─────────────────┴────────────────────┴────────────────────┘

WHEN TO USE RAGAS:
  ✓ Evaluating RAG pipelines specifically
  ✓ Need retrieval quality metrics
  ✓ LangChain-based applications
  ✓ Quick RAG quality assessment

WHEN TO USE DEEPEVAL:
  ✓ General LLM evaluation
  ✓ CI/CD integration needed
  ✓ Custom evaluation criteria
  ✓ Non-RAG use cases
""")


RAGAS: RAG Assessment Framework

✓ RAGAS installed

RAGAS ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  RAG Pipeline Output
         │
         ▼
  ┌─────────────────────────────────────────────────────┐
  │                    RAGAS Evaluation                  │
  ├─────────────────────────────────────────────────────┤
  │                                                      │
  │  RETRIEVER METRICS           GENERATOR METRICS       │
  │  ─────────────────           ─────────────────       │
  │  • Context Precision         • Faithfulness          │
  │  • Context Recall            • Answer Relevancy      │
  │  • Context Relevancy         • Answer Correctness    │
  │                                                      │
  │                 END-TO-END                           │
  │                 ──────────                           │
  │                 • Answer Similarity                  │
  │                 • Answer Semantic Similarity    

In [12]:
# Demo 5b: RAGAS Metric Simulation with Local LLM

print("="*65)
print("RAGAS: Local Metric Simulation")
print("="*65)

class LocalRAGASEvaluator:
    """Simulates RAGAS metrics using local Ollama models."""
    
    def __init__(self, model: str = EVAL_MODEL):
        self.model = model
    
    def context_precision(self, question: str, contexts: List[str]) -> float:
        """
        Context Precision: What fraction of retrieved contexts are relevant?
        Measures retriever precision.
        """
        prompt = f"""Evaluate if each context is relevant to answering the question.

QUESTION: {question}

CONTEXTS:
{chr(10).join(f'{i+1}. {c}' for i, c in enumerate(contexts))}

For each context, respond with 1 if relevant, 0 if not.
Output ONLY a JSON: {{"relevance": [1, 0, 1, ...]}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    result = json.loads(match.group())
                    relevance = result.get("relevance", [])
                    if relevance:
                        return sum(relevance) / len(relevance)
            except:
                pass
        return 0.7  # Default
    
    def faithfulness(self, answer: str, contexts: List[str]) -> float:
        """
        Faithfulness: Is the answer grounded in the contexts?
        """
        prompt = f"""Check if the answer is supported by the contexts.

CONTEXTS:
{chr(10).join(contexts)}

ANSWER: {answer}

Rate from 0.0 (no support) to 1.0 (fully supported).
Output ONLY: {{"score": <number>}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    return float(json.loads(match.group()).get("score", 0.5))
            except:
                pass
        return 0.8
    
    def answer_relevancy(self, question: str, answer: str) -> float:
        """
        Answer Relevancy: Does the answer address the question?
        """
        prompt = f"""Does this answer address the question?

QUESTION: {question}
ANSWER: {answer}

Rate from 0.0 (irrelevant) to 1.0 (perfectly relevant).
Output ONLY: {{"score": <number>}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    return float(json.loads(match.group()).get("score", 0.5))
            except:
                pass
        return 0.85
    
    def evaluate_rag(self, question: str, answer: str, contexts: List[str]) -> Dict[str, float]:
        """Run all RAGAS metrics on a single sample."""
        return {
            "context_precision": self.context_precision(question, contexts),
            "faithfulness": self.faithfulness(answer, contexts),
            "answer_relevancy": self.answer_relevancy(question, answer)
        }

# Demo
print("\n[RAGAS Evaluation Demo]")
print("─"*65)

ragas_eval = LocalRAGASEvaluator()

# Sample RAG output
sample = {
    "question": "What programming languages does the company use?",
    "answer": "The company primarily uses Python for backend development and TypeScript for frontend. They also use Go for microservices.",
    "contexts": [
        "Tech Stack: Backend is built with Python (FastAPI), frontend uses React with TypeScript.",
        "Our microservices are written in Go for performance-critical components.",
        "The database layer uses PostgreSQL with Redis for caching.",
        "Team uses Git for version control and GitHub Actions for CI/CD."
    ]
}

print(f"  Question: {sample['question']}")
print(f"  Answer: {sample['answer'][:70]}...")
print(f"  Contexts: {len(sample['contexts'])} documents")

print("\n  Running RAGAS metrics...")
scores = ragas_eval.evaluate_rag(
    sample["question"],
    sample["answer"],
    sample["contexts"]
)

print("\n  Results:")
for metric, score in scores.items():
    bar = "█" * int(score * 20) + "░" * (20 - int(score * 20))
    print(f"    {metric:20}: {score:.2f} [{bar}]")


RAGAS: Local Metric Simulation

[RAGAS Evaluation Demo]
─────────────────────────────────────────────────────────────────
  Question: What programming languages does the company use?
  Answer: The company primarily uses Python for backend development and TypeScri...
  Contexts: 4 documents

  Running RAGAS metrics...

  Results:
    context_precision   : 0.50 [██████████░░░░░░░░░░]
    faithfulness        : 1.00 [████████████████████]
    answer_relevancy    : 1.00 [████████████████████]


---

# Demo 6: LLM-as-Judge Patterns

LLM-as-Judge achieves ~80% agreement with human evaluators. Key patterns:

| Pattern | Use Case | Cost |
|---------|----------|------|
| G-Eval | Custom criteria | Medium |
| Pairwise | A/B comparison | Higher |
| Reference-based | Ground truth available | Medium |
| Reference-free | No ground truth | Lower |

**Best Practices:**
- Use GPT-3.5 + examples instead of GPT-4 (10× cheaper, similar accuracy)
- Binary/low-precision scales (0-3) work as well as 0-100
- Sample 5-10% of production traffic for ongoing evaluation


In [13]:
# Demo 6: LLM-as-Judge Patterns

print("="*65)
print("LLM-AS-JUDGE: Evaluation Patterns")
print("="*65)

class LLMJudge:
    """Implements various LLM-as-Judge patterns using local Ollama."""
    
    def __init__(self, model: str = EVAL_MODEL):
        self.model = model
    
    def geval(self, response: str, criteria: str, evaluation_steps: List[str]) -> Dict:
        """
        G-Eval: Custom evaluation with explicit criteria.
        From: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
        """
        steps_text = "\n".join(f"{i+1}. {step}" for i, step in enumerate(evaluation_steps))
        
        prompt = f"""You are an evaluation judge. Evaluate the response based on the given criteria.

CRITERIA: {criteria}

EVALUATION STEPS:
{steps_text}

RESPONSE TO EVALUATE:
{response}

Based on the criteria and steps, rate from 1 (poor) to 5 (excellent).
Output ONLY: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    result = json.loads(match.group())
                    return {
                        "score": result.get("score", 3),
                        "normalized_score": result.get("score", 3) / 5,
                        "reasoning": result.get("reasoning", "")
                    }
            except:
                pass
        return {"score": 3, "normalized_score": 0.6, "reasoning": "Default"}
    
    def pairwise(self, query: str, response_a: str, response_b: str, criteria: str) -> Dict:
        """
        Pairwise comparison: Which response is better?
        More reliable than absolute scoring.
        """
        prompt = f"""Compare two responses and decide which is better.

QUERY: {query}

CRITERIA: {criteria}

RESPONSE A:
{response_a}

RESPONSE B:
{response_b}

Which response is better based on the criteria?
Output ONLY: {{"winner": "A" or "B", "reasoning": "<brief explanation>"}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    result = json.loads(match.group())
                    return {
                        "winner": result.get("winner", "A"),
                        "reasoning": result.get("reasoning", "")
                    }
            except:
                pass
        return {"winner": "A", "reasoning": "Default"}
    
    def reference_based(self, response: str, reference: str) -> Dict:
        """
        Reference-based: Compare response to ground truth.
        """
        prompt = f"""Compare the response to the reference answer.

REFERENCE (Ground Truth):
{reference}

RESPONSE TO EVALUATE:
{response}

How well does the response match the reference in terms of correctness and completeness?
Rate from 0.0 (completely wrong) to 1.0 (matches perfectly).
Output ONLY: {{"score": <number>, "issues": "<any issues found>"}}"""
        
        if ollama_ready:
            raw, _ = ollama_generate(prompt, self.model, temperature=0.1)
            raw = clean_response(raw)
            try:
                match = re.search(r'\{[^}]+\}', raw)
                if match:
                    result = json.loads(match.group())
                    return {
                        "score": float(result.get("score", 0.5)),
                        "issues": result.get("issues", "")
                    }
            except:
                pass
        return {"score": 0.75, "issues": "Default"}

# Demo the patterns
print("\n[1. G-Eval: Custom Criteria]")
print("─"*65)

judge = LLMJudge()

geval_result = judge.geval(
    response="Thank you for contacting us! I understand your frustration with the delayed order. I've checked our system and see that your package was shipped yesterday. You should receive it within 2-3 business days. Is there anything else I can help with?",
    criteria="Professional and empathetic customer service response",
    evaluation_steps=[
        "Check if the response acknowledges the customer's concern",
        "Verify the response provides a concrete solution or next steps",
        "Assess the tone for professionalism and empathy",
        "Check for unnecessary jargon or overly formal language"
    ]
)

print(f"  Score: {geval_result['score']}/5 ({geval_result['normalized_score']:.0%})")
print(f"  Reasoning: {geval_result['reasoning'][:80]}...")


LLM-AS-JUDGE: Evaluation Patterns

[1. G-Eval: Custom Criteria]
─────────────────────────────────────────────────────────────────
  Score: 5/5 (100%)
  Reasoning: Acknowledges customer's frustration, provides specific delivery timeline (shippe...


In [14]:
# Demo 6b: More LLM-as-Judge Patterns

print("="*65)
print("LLM-AS-JUDGE: Pairwise and Reference-Based")
print("="*65)

# Pairwise comparison
print("\n[2. Pairwise Comparison]")
print("─"*65)

pairwise_result = judge.pairwise(
    query="How do I reset my password?",
    response_a="Go to settings and click reset password.",
    response_b="To reset your password: 1) Click your profile icon, 2) Select 'Settings', 3) Click 'Security', 4) Click 'Reset Password', 5) Check your email for the reset link. The link expires in 24 hours.",
    criteria="Completeness and helpfulness"
)

print(f"  Query: How do I reset my password?")
print(f"  Response A: Go to settings and click reset password.")
print(f"  Response B: To reset your password: 1) Click your profile icon...")
print(f"\n  Winner: Response {pairwise_result['winner']}")
print(f"  Reasoning: {pairwise_result['reasoning'][:80]}...")

# Reference-based comparison
print("\n[3. Reference-Based Evaluation]")
print("─"*65)

ref_result = judge.reference_based(
    response="Python was created by Guido van Rossum in 1991. It's a high-level programming language.",
    reference="Python was created by Guido van Rossum and first released in 1991. It is a high-level, interpreted programming language known for its readability and versatility."
)

print(f"  Reference: Python was created by Guido van Rossum and first released in 1991...")
print(f"  Response: Python was created by Guido van Rossum in 1991...")
print(f"\n  Score: {ref_result['score']:.2f}")
print(f"  Issues: {ref_result['issues'][:80]}..." if ref_result['issues'] else "  Issues: None")

# Summary
print("\n" + "═"*65)
print("LLM-AS-JUDGE BEST PRACTICES")
print("═"*65)
print("""
  1. COST OPTIMIZATION:
     • Use smaller models (GPT-3.5, local Llama) with few-shot examples
     • Binary/low-precision scales reduce token cost
     • Cache judge responses for similar inputs
  
  2. RELIABILITY:
     • Pairwise comparison > absolute scoring
     • Use chain-of-thought for complex criteria
     • Sample multiple judgments and average
  
  3. PRODUCTION INTEGRATION:
     • Sample 5-10% of traffic for evaluation
     • Track judge agreement over time
     • Calibrate against human labels periodically
  
  4. COMMON METRICS:
     • Helpfulness (1-5)
     • Harmlessness (pass/fail)
     • Honesty (0-1)
     • Task completion (pass/fail)
""")


LLM-AS-JUDGE: Pairwise and Reference-Based

[2. Pairwise Comparison]
─────────────────────────────────────────────────────────────────
  Query: How do I reset my password?
  Response A: Go to settings and click reset password.
  Response B: To reset your password: 1) Click your profile icon...

  Winner: Response B
  Reasoning: Response B provides a detailed, step-by-step guide with specific actions and cri...

[3. Reference-Based Evaluation]
─────────────────────────────────────────────────────────────────
  Reference: Python was created by Guido van Rossum and first released in 1991...
  Response: Python was created by Guido van Rossum in 1991...

  Score: 0.70
  Issues: Missing 'interpreted' and the known characteristics (readability and versatility...

═════════════════════════════════════════════════════════════════
LLM-AS-JUDGE BEST PRACTICES
═════════════════════════════════════════════════════════════════

  1. COST OPTIMIZATION:
     • Use smaller models (GPT-3.5, local Llama)

---

# Demo 7: Production Failure Mode Checklist

Common LLM failure modes and how to detect/mitigate them:

| Failure Mode | Symptom | Detection | Fix |
|-------------|---------|-----------|-----|
| Prompt Drift | Quality degrades over time | Monitor quality scores | Pin model versions |
| Context Overflow | Ignores important context | Track context length | Better chunking |
| Cost Explosion | Bills higher than expected | Budget alerts | Token limits |
| Hallucination Spike | Wrong answers | Faithfulness metric | Improve retrieval |
| Latency Regression | Slow responses | P95 monitoring | Timeouts, caching |


In [15]:
# Demo 7: Production Failure Mode Checklist

print("="*65)
print("PRODUCTION: Failure Mode Checklist")
print("="*65)

FAILURE_CHECKLIST = """
LLM System Failure Mode Checklist
==================================

PRE-DEPLOYMENT:
☐ Model validated on YOUR data (not just public benchmarks)
☐ Structured output tested with edge cases
☐ Guardrails configured and tested (jailbreak, PII, toxicity)
☐ Hallucination baseline measured
☐ Cost projections validated with realistic traffic estimates
☐ Latency tested under load

MONITORING (Day 1):
☐ Observability deployed (traces, tokens, costs)
☐ Alerts configured (error rate, latency P95, cost spikes)
☐ Evaluation pipeline running (5% sample with LLM-as-judge)
☐ User feedback collection enabled

ONGOING:
☐ Weekly: Review quality scores, cost trends
☐ Monthly: Re-evaluate model selection (new models may be better/cheaper)
☐ Quarterly: Refresh evaluation dataset with production examples
☐ Ad-hoc: Investigate quality degradation signals
"""

print(FAILURE_CHECKLIST)

# Common failure modes with detection
print("═"*65)
print("COMMON FAILURE MODES")
print("═"*65)

failure_modes = [
    {
        "name": "PROMPT DRIFT",
        "symptom": "Quality degrades over time without code changes",
        "cause": "Model updates by provider, data distribution shift",
        "detection": "Track quality scores weekly, compare to baseline",
        "fix": "Pin model versions, monitor quality metrics, A/B test updates"
    },
    {
        "name": "CONTEXT OVERFLOW",
        "symptom": "Responses ignore important context",
        "cause": "Exceeded context window, 'lost in the middle' effect",
        "detection": "Track avg context length, test with long inputs",
        "fix": "Better chunking, reranking, hierarchical summarization"
    },
    {
        "name": "COST EXPLOSION",
        "symptom": "Bills much higher than projected",
        "cause": "Verbose prompts, chatty responses, missing caching",
        "detection": "Daily cost monitoring, per-request cost tracking",
        "fix": "Audit token usage, implement output length limits, add routing"
    },
    {
        "name": "HALLUCINATION SPIKE",
        "symptom": "Users report factually wrong answers",
        "cause": "Poor retrieval quality, model uncertainty",
        "detection": "Faithfulness metric, user feedback analysis",
        "fix": "Improve retrieval, add confidence thresholds, HaluGate"
    },
    {
        "name": "LATENCY REGRESSION",
        "symptom": "Response times increase",
        "cause": "Larger context, provider issues, cold starts",
        "detection": "P95 latency monitoring, TTFT tracking",
        "fix": "Implement timeouts, add caching, optimize context"
    },
    {
        "name": "GUARDRAIL BYPASS",
        "symptom": "Harmful/off-topic responses get through",
        "cause": "New attack patterns, incomplete rules",
        "detection": "Safety metric sampling, user reports",
        "fix": "Red team regularly, update guardrails, add NeMo"
    },
]

for fm in failure_modes:
    print(f"\n{fm['name']}")
    print("─"*40)
    print(f"  Symptom:   {fm['symptom']}")
    print(f"  Cause:     {fm['cause']}")
    print(f"  Detection: {fm['detection']}")
    print(f"  Fix:       {fm['fix']}")


PRODUCTION: Failure Mode Checklist

LLM System Failure Mode Checklist

PRE-DEPLOYMENT:
☐ Model validated on YOUR data (not just public benchmarks)
☐ Structured output tested with edge cases
☐ Guardrails configured and tested (jailbreak, PII, toxicity)
☐ Hallucination baseline measured
☐ Cost projections validated with realistic traffic estimates
☐ Latency tested under load

MONITORING (Day 1):
☐ Observability deployed (traces, tokens, costs)
☐ Alerts configured (error rate, latency P95, cost spikes)
☐ Evaluation pipeline running (5% sample with LLM-as-judge)
☐ User feedback collection enabled

ONGOING:
☐ Weekly: Review quality scores, cost trends
☐ Monthly: Re-evaluate model selection (new models may be better/cheaper)
☐ Quarterly: Refresh evaluation dataset with production examples
☐ Ad-hoc: Investigate quality degradation signals

═════════════════════════════════════════════════════════════════
COMMON FAILURE MODES
═════════════════════════════════════════════════════════════════

P

In [16]:
# Demo 7b: Automated Failure Detection System

print("="*65)
print("PRODUCTION: Automated Failure Detection")
print("="*65)

@dataclass
class HealthMetrics:
    """Production health metrics for LLM system."""
    latency_p50_ms: float
    latency_p95_ms: float
    error_rate: float
    avg_input_tokens: int
    avg_output_tokens: int
    daily_cost: float
    faithfulness_score: float
    relevancy_score: float
    user_satisfaction: float  # 0-1

class FailureDetector:
    """Detects common LLM failure modes from metrics."""
    
    def __init__(self, baseline: HealthMetrics):
        self.baseline = baseline
        self.alerts = []
    
    def check_latency_regression(self, current: HealthMetrics) -> Optional[str]:
        if current.latency_p95_ms > self.baseline.latency_p95_ms * 1.5:
            return f"⚠ LATENCY REGRESSION: P95 {current.latency_p95_ms:.0f}ms vs baseline {self.baseline.latency_p95_ms:.0f}ms"
        return None
    
    def check_cost_explosion(self, current: HealthMetrics) -> Optional[str]:
        if current.daily_cost > self.baseline.daily_cost * 1.3:
            return f"⚠ COST SPIKE: €{current.daily_cost:.2f}/day vs baseline €{self.baseline.daily_cost:.2f}/day"
        return None
    
    def check_quality_degradation(self, current: HealthMetrics) -> Optional[str]:
        if current.faithfulness_score < self.baseline.faithfulness_score * 0.9:
            return f"⚠ QUALITY DROP: Faithfulness {current.faithfulness_score:.2f} vs baseline {self.baseline.faithfulness_score:.2f}"
        return None
    
    def check_context_bloat(self, current: HealthMetrics) -> Optional[str]:
        if current.avg_input_tokens > self.baseline.avg_input_tokens * 1.4:
            return f"⚠ CONTEXT BLOAT: {current.avg_input_tokens} tokens vs baseline {self.baseline.avg_input_tokens}"
        return None
    
    def check_all(self, current: HealthMetrics) -> List[str]:
        alerts = []
        checks = [
            self.check_latency_regression,
            self.check_cost_explosion,
            self.check_quality_degradation,
            self.check_context_bloat,
        ]
        for check in checks:
            result = check(current)
            if result:
                alerts.append(result)
        return alerts

# Simulate baseline and current metrics
baseline = HealthMetrics(
    latency_p50_ms=150,
    latency_p95_ms=400,
    error_rate=0.02,
    avg_input_tokens=800,
    avg_output_tokens=200,
    daily_cost=50.0,
    faithfulness_score=0.85,
    relevancy_score=0.90,
    user_satisfaction=0.78
)

current = HealthMetrics(
    latency_p50_ms=220,      # Increased
    latency_p95_ms=650,      # Regression!
    error_rate=0.025,
    avg_input_tokens=1100,   # Bloated
    avg_output_tokens=250,
    daily_cost=72.0,         # Spike!
    faithfulness_score=0.82,
    relevancy_score=0.88,
    user_satisfaction=0.72
)

print("\n[Baseline Metrics]")
print("─"*65)
print(f"  Latency P95:    {baseline.latency_p95_ms:.0f}ms")
print(f"  Daily cost:     €{baseline.daily_cost:.2f}")
print(f"  Avg tokens in:  {baseline.avg_input_tokens}")
print(f"  Faithfulness:   {baseline.faithfulness_score:.2f}")

print("\n[Current Metrics]")
print("─"*65)
print(f"  Latency P95:    {current.latency_p95_ms:.0f}ms")
print(f"  Daily cost:     €{current.daily_cost:.2f}")
print(f"  Avg tokens in:  {current.avg_input_tokens}")
print(f"  Faithfulness:   {current.faithfulness_score:.2f}")

# Run detection
detector = FailureDetector(baseline)
alerts = detector.check_all(current)

print("\n[Failure Detection Results]")
print("─"*65)
if alerts:
    for alert in alerts:
        print(f"  {alert}")
else:
    print("  ✓ All systems nominal")

# Summary
print("\n" + "═"*65)
print("PRODUCTION OPERATIONS: COMPLETE SUMMARY")
print("═"*65)
print("""
  OBSERVABILITY STACK:
    • Phoenix - Self-hosted, RAG-focused, Apache 2.0
    • Langfuse - Open source standard, MIT, EU cloud option
    • LangSmith - Best for LangChain users
    
  EVALUATION STACK:
    • DeepEval - Pytest integration, CI/CD ready
    • RAGAS - RAG-specific metrics
    • LLM-as-Judge - Custom criteria, 80% human agreement
    
  MONITORING CADENCE:
    • Real-time: Latency, errors, costs
    • Daily: Quality scores, cost trends
    • Weekly: Evaluation pipeline review
    • Monthly: Model re-evaluation
    
  KEY METRICS:
    • Faithfulness, Relevancy, Context Precision
    • Latency (P50, P95, TTFT)
    • Token costs, error rates
    • User satisfaction (thumbs up/down)
""")


PRODUCTION: Automated Failure Detection

[Baseline Metrics]
─────────────────────────────────────────────────────────────────
  Latency P95:    400ms
  Daily cost:     €50.00
  Avg tokens in:  800
  Faithfulness:   0.85

[Current Metrics]
─────────────────────────────────────────────────────────────────
  Latency P95:    650ms
  Daily cost:     €72.00
  Avg tokens in:  1100
  Faithfulness:   0.82

[Failure Detection Results]
─────────────────────────────────────────────────────────────────
  ⚠ LATENCY REGRESSION: P95 650ms vs baseline 400ms
  ⚠ COST SPIKE: €72.00/day vs baseline €50.00/day

═════════════════════════════════════════════════════════════════
PRODUCTION OPERATIONS: COMPLETE SUMMARY
═════════════════════════════════════════════════════════════════

  OBSERVABILITY STACK:
    • Phoenix - Self-hosted, RAG-focused, Apache 2.0
    • Langfuse - Open source standard, MIT, EU cloud option
    • LangSmith - Best for LangChain users

  EVALUATION STACK:
    • DeepEval - Pytest integ