# Week 5 - Lab 3: Query Rewriting & Production RAG Patterns

**Duration:** 90-120 minutes  
**Level:** Advanced  
**Prerequisites:** Week 5 Lessons 3-4, Labs 1-2

---

## ðŸŽ¯ Learning Objectives

In this lab, you will:
- Implement query rewriting techniques (HyDE, Multi-Query, Step-Back)
- Measure impact of query rewriting on recall
- Build production-grade RAG with circuit breakers
- Implement feature flags for A/B testing
- Add structured logging with trace IDs
- Create observability dashboard data
- Measure SLO compliance (latency, availability)

---

## ðŸ“‹ Lab Outline

1. Setup and Baseline RAG
2. Exercise 1: HyDE (Hypothetical Document Embeddings)
3. Exercise 2: Multi-Query Expansion
4. Exercise 3: Step-Back Prompting
5. Exercise 4: Query Rewriting Comparison
6. Exercise 5: Circuit Breaker Pattern
7. Exercise 6: Feature Flags for A/B Testing
8. Exercise 7: Structured Logging & Observability
9. Bonus Challenge: SLO Monitoring Dashboard

---

## 1. Setup and Baseline RAG

In [None]:
# Install required packages
!pip install -q openai numpy python-dotenv

In [None]:
import os
import time
import json
import uuid
import logging
import numpy as np
from typing import List, Dict, Optional, Tuple
from datetime import datetime
from enum import Enum
from dataclasses import dataclass, asdict
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

print("âœ… Setup complete!")

### Sample Corpus and Helper Functions

In [None]:
# Sample technical documentation corpus
CORPUS = [
    {"id": "doc1", "text": "Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications."},
    {"id": "doc2", "text": "Vector databases store embeddings and enable semantic search through similarity calculations like cosine distance."},
    {"id": "doc3", "text": "RAG (Retrieval-Augmented Generation) combines information retrieval with language model generation for grounded responses."},
    {"id": "doc4", "text": "Circuit breakers prevent cascading failures by failing fast when error rates exceed thresholds."},
    {"id": "doc5", "text": "Feature flags enable gradual rollouts and A/B testing by controlling feature availability at runtime."},
    {"id": "doc6", "text": "HNSW (Hierarchical Navigable Small World) graphs provide efficient approximate nearest neighbor search."},
    {"id": "doc7", "text": "Observability requires collecting logs, metrics, and traces to understand system behavior in production."},
    {"id": "doc8", "text": "SLOs (Service Level Objectives) define target reliability metrics like 99.9% availability and p95 latency under 200ms."},
    {"id": "doc9", "text": "Query rewriting techniques like HyDE generate hypothetical answers to improve retrieval accuracy."},
    {"id": "doc10", "text": "Multi-tenant systems isolate customer data while sharing infrastructure for efficiency."},
]

def get_embedding(text: str, model: str = "text-embedding-3-small") -> List[float]:
    """Get embedding for single text."""
    response = client.embeddings.create(input=[text.replace("\n", " ")], model=model)
    return response.data[0].embedding

def get_embeddings_batch(texts: List[str]) -> np.ndarray:
    """Get embeddings for multiple texts."""
    cleaned = [t.replace("\n", " ") for t in texts]
    response = client.embeddings.create(input=cleaned, model="text-embedding-3-small")
    return np.array([item.embedding for item in response.data])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Calculate cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Generate corpus embeddings
print("Generating corpus embeddings...")
corpus_texts = [doc["text"] for doc in CORPUS]
corpus_embeddings = get_embeddings_batch(corpus_texts)
print(f"âœ… Generated {len(corpus_embeddings)} embeddings")

In [None]:
def simple_retrieve(query: str, k: int = 3) -> List[Dict]:
    """Baseline retrieval: embed query and find top-k by cosine similarity."""
    query_emb = np.array(get_embedding(query))
    
    similarities = []
    for i, doc_emb in enumerate(corpus_embeddings):
        sim = cosine_similarity(query_emb, doc_emb)
        similarities.append((i, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return [{**CORPUS[idx], "score": score} for idx, score in similarities[:k]]

# Test baseline retrieval
query = "How do I scale containerized applications?"
results = simple_retrieve(query, k=3)

print(f"Query: {query}\n")
print("Baseline retrieval:")
for doc in results:
    print(f"  {doc['id']}: {doc['score']:.3f} - {doc['text'][:60]}...")

---

## Exercise 1: HyDE (Hypothetical Document Embeddings)

**Task:** Implement HyDE - generate a hypothetical answer, then embed and search.

**Concept:** Instead of embedding the question directly, generate what the answer *might* look like, then search with that.

In [None]:
def hyde_retrieve(query: str, k: int = 3) -> List[Dict]:
    """
    HyDE retrieval: generate hypothetical answer, embed it, retrieve.
    """
    # TODO: Implement HyDE
    # 1. Generate hypothetical answer using LLM
    # 2. Embed the hypothetical answer
    # 3. Retrieve using hypothetical answer embedding
    
    # Step 1: Generate hypothetical answer
    hyde_prompt = f"""Write a detailed, technical answer to this question:

{query}

Answer as if you're writing documentation. Be specific and use technical terms."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": hyde_prompt}],
        temperature=0.7,
        max_tokens=200,
    )
    
    hypothetical_doc = response.choices[0].message.content
    print(f"Generated hypothetical doc: {hypothetical_doc[:100]}...\n")
    
    # Step 2: Embed hypothetical answer
    hyde_emb = np.array(get_embedding(hypothetical_doc))
    
    # Step 3: Retrieve using HyDE embedding
    similarities = []
    for i, doc_emb in enumerate(corpus_embeddings):
        sim = cosine_similarity(hyde_emb, doc_emb)
        similarities.append((i, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return [{**CORPUS[idx], "score": score} for idx, score in similarities[:k]]


# Test HyDE
query = "How do I scale containerized applications?"
hyde_results = hyde_retrieve(query, k=3)

print(f"Query: {query}\n")
print("HyDE retrieval:")
for doc in hyde_results:
    print(f"  {doc['id']}: {doc['score']:.3f} - {doc['text'][:60]}...")

---

## Exercise 2: Multi-Query Expansion

**Task:** Generate multiple query variations, retrieve for each, then merge results.

**Concept:** Rephrase the query multiple ways to capture different aspects.

In [None]:
def multi_query_retrieve(query: str, k: int = 3, n_variations: int = 3) -> List[Dict]:
    """
    Multi-query retrieval: generate query variations, retrieve for each, merge.
    """
    # TODO: Implement multi-query expansion
    # 1. Generate n_variations of the query
    # 2. Retrieve for each variation
    # 3. Merge results (deduplicate and aggregate scores)
    
    # Step 1: Generate query variations
    variation_prompt = f"""Generate {n_variations} different ways to ask this question. Each should capture a different aspect or use different terminology.

Original: {query}

Return only the {n_variations} variations, one per line, without numbering."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": variation_prompt}],
        temperature=0.7,
    )
    
    variations_text = response.choices[0].message.content
    variations = [line.strip() for line in variations_text.strip().split('\n') if line.strip()]
    variations = [query] + variations[:n_variations-1]  # Include original
    
    print("Query variations:")
    for i, var in enumerate(variations, 1):
        print(f"  {i}. {var}")
    print()
    
    # Step 2: Retrieve for each variation
    all_scores = {}
    for var in variations:
        var_results = simple_retrieve(var, k=k*2)
        for doc in var_results:
            doc_id = doc["id"]
            if doc_id not in all_scores:
                all_scores[doc_id] = []
            all_scores[doc_id].append(doc["score"])
    
    # Step 3: Aggregate scores (max score across variations)
    aggregated = []
    for doc_id, scores in all_scores.items():
        aggregated.append({
            "id": doc_id,
            "score": max(scores),
            "text": next(d["text"] for d in CORPUS if d["id"] == doc_id)
        })
    
    aggregated.sort(key=lambda x: x["score"], reverse=True)
    return aggregated[:k]


# Test multi-query
query = "How do I scale containerized applications?"
multi_results = multi_query_retrieve(query, k=3, n_variations=3)

print(f"\nOriginal query: {query}\n")
print("Multi-query retrieval:")
for doc in multi_results:
    print(f"  {doc['id']}: {doc['score']:.3f} - {doc['text'][:60]}...")

---

## Exercise 3: Step-Back Prompting

**Task:** Generate a higher-level "step-back" question, retrieve for it, then use those results.

**Concept:** Sometimes a broader question retrieves better foundational context.

In [None]:
def step_back_retrieve(query: str, k: int = 3) -> Tuple[str, List[Dict]]:
    """
    Step-back retrieval: generate broader question, retrieve for it.
    
    Returns:
        (step_back_query, results)
    """
    # TODO: Implement step-back prompting
    # 1. Generate a broader, more conceptual version of the query
    # 2. Retrieve using the step-back query
    
    # Step 1: Generate step-back query
    step_back_prompt = f"""Given this specific question, generate a broader, more general question that covers the underlying concepts.

Specific question: {query}

Broader question:"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": step_back_prompt}],
        temperature=0.3,
    )
    
    step_back_query = response.choices[0].message.content.strip()
    print(f"Original query: {query}")
    print(f"Step-back query: {step_back_query}\n")
    
    # Step 2: Retrieve using step-back query
    results = simple_retrieve(step_back_query, k=k)
    
    return step_back_query, results


# Test step-back
query = "How do I scale containerized applications?"
sb_query, sb_results = step_back_retrieve(query, k=3)

print("Step-back retrieval:")
for doc in sb_results:
    print(f"  {doc['id']}: {doc['score']:.3f} - {doc['text'][:60]}...")

---

## Exercise 4: Query Rewriting Comparison

**Task:** Compare all rewriting methods on multiple test queries.

In [None]:
# Test queries
TEST_QUERIES = [
    "How do I scale containerized applications?",
    "What is semantic search?",
    "How to prevent cascading failures?",
]

def compare_rewriting_methods(queries: List[str]):
    """Compare different query rewriting approaches."""
    results = []
    
    for query in queries:
        print(f"\n{'='*60}")
        print(f"Query: {query}")
        print('='*60)
        
        # Baseline
        print("\n1. Baseline (direct embedding):")
        baseline = simple_retrieve(query, k=3)
        for doc in baseline:
            print(f"   {doc['id']}: {doc['score']:.3f}")
        
        # HyDE
        print("\n2. HyDE:")
        hyde = hyde_retrieve(query, k=3)
        for doc in hyde:
            print(f"   {doc['id']}: {doc['score']:.3f}")
        
        # Multi-query
        print("\n3. Multi-query:")
        multi = multi_query_retrieve(query, k=3, n_variations=3)
        for doc in multi:
            print(f"   {doc['id']}: {doc['score']:.3f}")
        
        # Step-back
        print("\n4. Step-back:")
        sb_q, sb = step_back_retrieve(query, k=3)
        for doc in sb:
            print(f"   {doc['id']}: {doc['score']:.3f}")
        
        results.append({
            "query": query,
            "baseline": [d["id"] for d in baseline],
            "hyde": [d["id"] for d in hyde],
            "multi_query": [d["id"] for d in multi],
            "step_back": [d["id"] for d in sb],
        })
    
    return results


# Run comparison
comparison_results = compare_rewriting_methods(TEST_QUERIES)

---

## Exercise 5: Circuit Breaker Pattern

**Task:** Implement a circuit breaker to protect against cascading failures.

**States:**
- CLOSED: Normal operation
- OPEN: Failing fast, rejecting requests
- HALF_OPEN: Testing if service recovered

In [None]:
from enum import Enum
from dataclasses import dataclass
from typing import Callable, Any

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    timeout_seconds: float = 60.0
    half_open_max_calls: int = 3

class CircuitBreaker:
    """
    Circuit breaker implementation.
    """
    def __init__(self, config: CircuitBreakerConfig = None):
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0
    
    def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection."""
        # TODO: Implement circuit breaker logic
        # 1. Check state
        # 2. If OPEN, check if timeout elapsed
        # 3. Execute function
        # 4. Update state based on result
        
        # Check if circuit is OPEN
        if self.state == CircuitState.OPEN:
            # Check if timeout has elapsed
            if self.last_failure_time:
                elapsed = time.time() - self.last_failure_time
                if elapsed >= self.config.timeout_seconds:
                    print(f"  [CIRCUIT] OPEN -> HALF_OPEN (timeout elapsed)")
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    raise Exception(f"Circuit breaker OPEN (retry in {self.config.timeout_seconds - elapsed:.1f}s)")
            else:
                raise Exception("Circuit breaker OPEN")
        
        # Execute function
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        """Handle successful call."""
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_calls += 1
            if self.half_open_calls >= self.config.half_open_max_calls:
                print(f"  [CIRCUIT] HALF_OPEN -> CLOSED (recovered)")
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0
        
        self.success_count += 1
    
    def _on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.state == CircuitState.HALF_OPEN:
            print(f"  [CIRCUIT] HALF_OPEN -> OPEN (failure during recovery)")
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.config.failure_threshold:
            print(f"  [CIRCUIT] CLOSED -> OPEN (threshold reached: {self.failure_count})")
            self.state = CircuitState.OPEN
    
    def reset(self):
        """Reset circuit breaker."""
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None


# Test circuit breaker
def flaky_service(should_fail: bool = False):
    """Mock service that can fail."""
    if should_fail:
        raise Exception("Service failure")
    return "Success"

breaker = CircuitBreaker(CircuitBreakerConfig(failure_threshold=3, timeout_seconds=2))

print("Testing circuit breaker:\n")

# Cause failures to open circuit
for i in range(5):
    try:
        result = breaker.call(flaky_service, should_fail=True)
        print(f"Call {i+1}: {result}")
    except Exception as e:
        print(f"Call {i+1}: Failed - {e}")

print(f"\nCircuit state: {breaker.state.value}")
print("\nWaiting for timeout...")
time.sleep(2.5)

# Try recovery
print("\nAttempting recovery:")
for i in range(3):
    try:
        result = breaker.call(flaky_service, should_fail=False)
        print(f"Recovery call {i+1}: {result}")
    except Exception as e:
        print(f"Recovery call {i+1}: Failed - {e}")

print(f"\nFinal circuit state: {breaker.state.value}")

---

## Exercise 6: Feature Flags for A/B Testing

**Task:** Implement feature flags to control query rewriting strategy.

In [None]:
class FeatureFlags:
    """
    Feature flag system for A/B testing.
    """
    def __init__(self):
        self.flags = {}
    
    def set_flag(self, name: str, enabled: bool, rollout_pct: float = 100.0):
        """Set a feature flag."""
        self.flags[name] = {
            "enabled": enabled,
            "rollout_pct": rollout_pct,
        }
    
    def is_enabled(self, name: str, user_id: str = None) -> bool:
        """Check if feature is enabled for user."""
        # TODO: Implement feature flag check
        # 1. Check if flag exists
        # 2. If not enabled globally, return False
        # 3. Check rollout percentage
        
        if name not in self.flags:
            return False
        
        flag = self.flags[name]
        
        if not flag["enabled"]:
            return False
        
        # Check rollout percentage
        if user_id and flag["rollout_pct"] < 100.0:
            # Hash user_id to get consistent assignment
            hash_val = hash(user_id) % 100
            return hash_val < flag["rollout_pct"]
        
        return True


def rag_with_flags(query: str, user_id: str, flags: FeatureFlags, k: int = 3) -> Dict:
    """
    RAG retrieval with feature-flagged query rewriting.
    """
    metadata = {
        "query": query,
        "user_id": user_id,
        "rewriting_method": "baseline",
    }
    
    # Check feature flags
    if flags.is_enabled("use_hyde", user_id):
        metadata["rewriting_method"] = "hyde"
        results = hyde_retrieve(query, k=k)
    elif flags.is_enabled("use_multi_query", user_id):
        metadata["rewriting_method"] = "multi_query"
        results = multi_query_retrieve(query, k=k)
    elif flags.is_enabled("use_step_back", user_id):
        metadata["rewriting_method"] = "step_back"
        _, results = step_back_retrieve(query, k=k)
    else:
        results = simple_retrieve(query, k=k)
    
    return {
        "results": results,
        "metadata": metadata,
    }


# Test feature flags
flags = FeatureFlags()
flags.set_flag("use_hyde", enabled=True, rollout_pct=50.0)
flags.set_flag("use_multi_query", enabled=True, rollout_pct=30.0)

print("Testing feature flags with different users:\n")

for user_id in ["user_1", "user_2", "user_3", "user_4"]:
    result = rag_with_flags(
        "How do I scale containerized applications?",
        user_id,
        flags,
        k=2
    )
    method = result["metadata"]["rewriting_method"]
    print(f"{user_id}: {method}")
    for doc in result["results"]:
        print(f"  {doc['id']}: {doc['score']:.3f}")
    print()

---

## Exercise 7: Structured Logging & Observability

**Task:** Add structured JSON logging with trace IDs for observability.

In [None]:
@dataclass
class LogEntry:
    """Structured log entry."""
    timestamp: str
    trace_id: str
    level: str
    event: str
    metadata: Dict

class StructuredLogger:
    """Logger that outputs structured JSON."""
    
    def log(self, level: str, event: str, trace_id: str, **kwargs):
        """Log structured event."""
        entry = LogEntry(
            timestamp=datetime.utcnow().isoformat(),
            trace_id=trace_id,
            level=level,
            event=event,
            metadata=kwargs,
        )
        print(json.dumps(asdict(entry)))
    
    def info(self, event: str, trace_id: str, **kwargs):
        self.log("INFO", event, trace_id, **kwargs)
    
    def error(self, event: str, trace_id: str, **kwargs):
        self.log("ERROR", event, trace_id, **kwargs)


def production_rag(
    query: str,
    user_id: str,
    flags: FeatureFlags,
    breaker: CircuitBreaker,
    logger: StructuredLogger,
    k: int = 3,
) -> Dict:
    """
    Production RAG with observability.
    """
    trace_id = str(uuid.uuid4())
    start_time = time.time()
    
    try:
        # Log request start
        logger.info(
            "rag_request_start",
            trace_id=trace_id,
            user_id=user_id,
            query=query,
            k=k,
        )
        
        # Execute retrieval with circuit breaker
        def retrieve():
            return rag_with_flags(query, user_id, flags, k=k)
        
        result = breaker.call(retrieve)
        
        # Calculate latency
        latency_ms = (time.time() - start_time) * 1000
        
        # Log success
        logger.info(
            "rag_request_complete",
            trace_id=trace_id,
            user_id=user_id,
            latency_ms=latency_ms,
            rewriting_method=result["metadata"]["rewriting_method"],
            result_count=len(result["results"]),
            circuit_state=breaker.state.value,
        )
        
        return {
            **result,
            "trace_id": trace_id,
            "latency_ms": latency_ms,
        }
        
    except Exception as e:
        # Log error
        latency_ms = (time.time() - start_time) * 1000
        logger.error(
            "rag_request_failed",
            trace_id=trace_id,
            user_id=user_id,
            latency_ms=latency_ms,
            error=str(e),
            circuit_state=breaker.state.value,
        )
        raise


# Test production RAG
structured_logger = StructuredLogger()
production_breaker = CircuitBreaker()
production_flags = FeatureFlags()
production_flags.set_flag("use_hyde", enabled=True, rollout_pct=50.0)

print("\nProduction RAG with observability:\n")

result = production_rag(
    query="How do I scale containerized applications?",
    user_id="test_user",
    flags=production_flags,
    breaker=production_breaker,
    logger=structured_logger,
    k=3,
)

print(f"\nTrace ID: {result['trace_id']}")
print(f"Latency: {result['latency_ms']:.2f}ms")
print(f"Method: {result['metadata']['rewriting_method']}")

---

## Bonus Challenge: SLO Monitoring Dashboard

**Task:** Simulate multiple requests and calculate SLO metrics.

In [None]:
from typing import List
from dataclasses import dataclass

@dataclass
class RequestMetrics:
    """Metrics for a single request."""
    trace_id: str
    success: bool
    latency_ms: float
    method: str

def simulate_traffic(n_requests: int = 50) -> List[RequestMetrics]:
    """Simulate production traffic."""
    logger = StructuredLogger()
    breaker = CircuitBreaker()
    flags = FeatureFlags()
    flags.set_flag("use_hyde", enabled=True, rollout_pct=50.0)
    
    queries = [
        "How do I scale containerized applications?",
        "What is semantic search?",
        "How to prevent cascading failures?",
    ]
    
    metrics = []
    
    print("Simulating production traffic...\n")
    
    for i in range(n_requests):
        user_id = f"user_{i % 10}"
        query = queries[i % len(queries)]
        
        try:
            result = production_rag(
                query=query,
                user_id=user_id,
                flags=flags,
                breaker=breaker,
                logger=logger,
                k=3,
            )
            
            metrics.append(RequestMetrics(
                trace_id=result["trace_id"],
                success=True,
                latency_ms=result["latency_ms"],
                method=result["metadata"]["rewriting_method"],
            ))
            
        except Exception as e:
            metrics.append(RequestMetrics(
                trace_id=str(uuid.uuid4()),
                success=False,
                latency_ms=0.0,
                method="failed",
            ))
        
        # Small delay
        if i % 10 == 0:
            print(f"Processed {i}/{n_requests} requests...")
    
    return metrics


def calculate_slos(metrics: List[RequestMetrics]) -> Dict:
    """Calculate SLO metrics."""
    # TODO: Calculate SLO metrics
    # 1. Availability (success rate)
    # 2. Latency percentiles (p50, p95, p99)
    # 3. Method distribution
    
    total = len(metrics)
    successes = sum(1 for m in metrics if m.success)
    
    latencies = [m.latency_ms for m in metrics if m.success]
    
    method_counts = {}
    for m in metrics:
        method_counts[m.method] = method_counts.get(m.method, 0) + 1
    
    return {
        "total_requests": total,
        "successful_requests": successes,
        "availability_pct": (successes / total * 100) if total > 0 else 0,
        "latency_p50_ms": np.percentile(latencies, 50) if latencies else 0,
        "latency_p95_ms": np.percentile(latencies, 95) if latencies else 0,
        "latency_p99_ms": np.percentile(latencies, 99) if latencies else 0,
        "method_distribution": method_counts,
    }


# Run simulation
metrics = simulate_traffic(n_requests=20)  # Use smaller number for demo
slos = calculate_slos(metrics)

print("\n" + "="*60)
print("SLO DASHBOARD")
print("="*60)
print(f"\nTotal Requests: {slos['total_requests']}")
print(f"Successful: {slos['successful_requests']}")
print(f"Availability: {slos['availability_pct']:.2f}% (target: 99.9%)")
print(f"\nLatency:")
print(f"  p50: {slos['latency_p50_ms']:.2f}ms")
print(f"  p95: {slos['latency_p95_ms']:.2f}ms (target: <200ms)")
print(f"  p99: {slos['latency_p99_ms']:.2f}ms")
print(f"\nMethod Distribution:")
for method, count in slos['method_distribution'].items():
    pct = (count / slos['total_requests'] * 100)
    print(f"  {method}: {count} ({pct:.1f}%)")

---

## ðŸŽ‰ Lab Complete!

### What You Learned

- âœ… Query rewriting with HyDE, Multi-Query, Step-Back
- âœ… Comparing rewriting strategies on test queries
- âœ… Circuit breaker pattern for resilience
- âœ… Feature flags for gradual rollouts and A/B testing
- âœ… Structured logging with trace IDs
- âœ… SLO monitoring (availability, latency)
- âœ… Production-grade RAG implementation

### Key Takeaways

1. **Query Rewriting** improves recall by 10-30% depending on query type
2. **HyDE** works best for conceptual questions
3. **Multi-Query** improves coverage for ambiguous queries
4. **Step-Back** retrieves better foundational context
5. **Circuit Breakers** prevent cascading failures
6. **Feature Flags** enable safe experimentation
7. **Structured Logging** enables debugging and analysis
8. **SLO Monitoring** ensures production reliability

### Production Recommendations

- Start with baseline, measure performance
- A/B test query rewriting methods with feature flags
- Set circuit breaker thresholds based on observed error rates
- Log all requests with trace IDs for debugging
- Monitor SLOs continuously (99.9% availability, p95 < 200ms)
- Use gradual rollouts (10% â†’ 50% â†’ 100%)

### Next Steps

1. Integrate with production vector database
2. Add distributed tracing (OpenTelemetry)
3. Set up alerts for SLO violations
4. Implement cost tracking per method
5. Create Grafana dashboard for real-time monitoring

### Resources

- Week 5 Resources: [../resources/README.md](../resources/README.md)
- Week 4 Monitoring Guide: [../../week-04/resources/monitoring-production-rag.md](../../week-04/resources/monitoring-production-rag.md)
- Evaluation Harness: [../resources/examples/](../resources/examples/)