# Demo #1: Query Enhancement with HyDE (Hypothetical Document Embeddings)

## 🎯 Workshop Objectives

In this notebook, we'll explore **Hypothetical Document Embeddings (HyDE)**, an advanced RAG technique that bridges the semantic gap between user queries and documents. We'll learn how to:

1. Understand the **query-to-document asymmetry problem**
2. Implement a **baseline Naive RAG** system
3. Build an **advanced HyDE-enhanced RAG** pipeline
4. Compare performance and analyze when HyDE excels
5. Explore advanced HyDE variations

## 📚 Core Concepts

### The Query-Document Asymmetry Problem

Traditional RAG systems face a fundamental challenge:
- **User queries** are typically short, keyword-focused, and question-like
- **Documents** are long, verbose, and declarative

When we embed both into the same vector space, the semantic mismatch can lead to suboptimal retrieval results.

### HyDE Solution

HyDE takes a clever approach:
1. Instead of embedding the query directly, we ask an LLM to generate a **hypothetical ideal answer**
2. We embed this rich, document-like answer
3. We search using this embedding (answer-to-answer similarity vs. query-to-document)

This transforms the search from "query→document" to "answer→answer", significantly improving semantic alignment.

---

## 1. Environment Setup and Dependencies

First, let's install and import all required libraries for this demo.

In [None]:
# Install required packages (uncomment if running for the first time)
# !pip install langchain langchain-openai langchain-community chromadb sentence-transformers openai tiktoken

import os
import warnings
from typing import List, Dict, Tuple
import numpy as np
from collections import defaultdict

# LangChain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("✅ All libraries imported successfully!")

### Configure API Keys and LLM

**Important:** Set your OpenAI API key as an environment variable or directly in the code (for demo purposes only).

In [None]:
# Set your OpenAI API key
# Option 1: Set as environment variable (recommended)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Option 2: Load from a .env file
# from dotenv import load_dotenv
# load_dotenv()

# Verify API key is set
if "OPENAI_API_KEY" not in os.environ:
    print("⚠️  WARNING: OPENAI_API_KEY not found in environment variables!")
    print("Please set it before proceeding.")
else:
    print("✅ API key configured successfully!")

# Initialize the LLM for query generation and answer generation
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=500
)

print(f"✅ LLM initialized: {llm.model_name}")

## 2. Data Ingestion and Knowledge Base Creation

Let's create a sample knowledge base about advanced RAG techniques. In a real scenario, you would load this from PDFs, web scraping, or databases.

In [None]:
# Sample knowledge base: Technical documentation about RAG systems
sample_documents = [
    """
    Retrieval-Augmented Generation (RAG) is an AI framework that combines the power of large language models 
    with external knowledge retrieval. The core principle is to retrieve relevant information from a knowledge 
    base before generating a response. This approach significantly reduces hallucinations and provides more 
    accurate, up-to-date information. RAG systems typically consist of three main components: a retriever 
    that finds relevant documents, an embedder that converts text to vectors, and a generator that produces 
    the final answer based on retrieved context.
    """,
    """
    Vector embeddings are numerical representations of text that capture semantic meaning. In RAG systems, 
    both documents and queries are converted into high-dimensional vectors using embedding models like 
    sentence-transformers or OpenAI's ada-002. The similarity between vectors is typically measured using 
    cosine similarity or Euclidean distance. Embeddings enable semantic search, where documents can be 
    retrieved based on meaning rather than exact keyword matches. The quality of embeddings directly 
    impacts retrieval performance.
    """,
    """
    Chunking strategies are critical for RAG performance. Documents must be split into smaller segments 
    that balance context and precision. Fixed-size chunking divides text into equal-length segments, 
    while recursive character splitting respects natural boundaries like paragraphs and sentences. 
    Smaller chunks (200-500 tokens) provide precise retrieval but may lack context. Larger chunks 
    (1000-2000 tokens) preserve context but create noisy embeddings. The optimal chunk size depends 
    on the document structure and query complexity.
    """,
    """
    Hypothetical Document Embeddings (HyDE) is an advanced technique that addresses the asymmetry between 
    queries and documents. Instead of embedding the user's query directly, HyDE first uses an LLM to 
    generate a hypothetical ideal answer to the query. This generated answer is then embedded and used 
    for retrieval. Since the hypothetical answer is document-like (verbose, declarative), it better 
    matches the embedding space of the actual documents, leading to improved retrieval accuracy.
    """,
    """
    Cross-encoder re-rankers are models that take both a query and a document as input and output a 
    relevance score. Unlike bi-encoders which create separate embeddings, cross-encoders perform joint 
    encoding with attention across both texts. This makes them more accurate but slower than bi-encoders. 
    In a two-stage retrieval pipeline, fast bi-encoders retrieve a broad set of candidates (high recall), 
    and slower cross-encoders re-rank them for precision. Popular models include bge-rerank-large and 
    Cohere's rerank endpoint.
    """,
    """
    Context window optimization is essential for effective RAG. LLMs have token limits (e.g., 4k, 8k, 
    128k tokens) that constrain how much context can be provided. The 'lost in the middle' problem 
    shows that LLMs pay more attention to information at the beginning and end of the context. 
    Strategies include strategic reordering (placing most relevant docs at start/end), extractive 
    compression (filtering sentences by relevance), and abstractive summarization. Token counting 
    and dynamic truncation ensure context fits within limits.
    """,
    """
    Hybrid search combines dense vector search with sparse keyword search (BM25). Dense vectors capture 
    semantic meaning and handle synonyms well, but struggle with exact matches like acronyms or proper 
    names. BM25 excels at keyword matching but lacks semantic understanding. Hybrid search runs both 
    methods in parallel and fuses results using weighted fusion (combining scores) or Reciprocal Rank 
    Fusion (combining ranks). This approach leverages complementary strengths, improving retrieval 
    robustness across diverse query types.
    """,
    """
    Graph-based retrieval (GraphRAG) models knowledge as nodes and relationships rather than text chunks. 
    Entities are extracted from documents and connected via relationships to form a knowledge graph. 
    At query time, the system can traverse the graph to perform multi-hop reasoning. For example, to 
    answer 'Which movies directed by Christopher Nolan starred Michael Caine?', the system finds Nolan's 
    node, follows 'DIRECTED' edges to movies, then checks for 'STARRED_IN' edges to Caine. GraphRAG 
    enables complex relational queries that are difficult for vector search alone.
    """
]

# Create Document objects
documents = [Document(page_content=doc.strip(), metadata={"source": f"doc_{i}"}) 
             for i, doc in enumerate(sample_documents)]

print(f"✅ Created {len(documents)} sample documents")
print(f"📊 Average document length: {sum(len(d.page_content) for d in documents) / len(documents):.0f} characters")

### Chunk Documents

We'll use recursive character splitting to create semantically coherent chunks.

In [None]:
# Initialize text splitter with semantic-aware parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,  # Moderate chunk size balancing precision and context
    chunk_overlap=50,  # Overlap to maintain continuity
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Prioritize paragraph/sentence boundaries
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"✅ Split {len(documents)} documents into {len(chunks)} chunks")
print(f"📊 Average chunk length: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} characters")
print(f"\n📄 Sample chunk:")
print(f"{'='*70}")
print(chunks[0].page_content[:300] + "...")
print(f"{'='*70}")

### Create Embeddings and Vector Store

We'll use HuggingFace's sentence-transformers as our bi-encoder for creating dense vector embeddings.

In [None]:
# Initialize embedding model (using a lightweight, high-quality model)
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},  # Use 'cuda' if GPU available
    encode_kwargs={'normalize_embeddings': True}  # Normalize for cosine similarity
)

print("✅ Embedding model loaded: all-MiniLM-L6-v2")
print("📊 Embedding dimension: 384")

# Create vector store with ChromaDB
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    collection_name="rag_knowledge_base",
    persist_directory=None  # In-memory for this demo
)

print(f"✅ Vector store created with {len(chunks)} document chunks")
print(f"💾 Vector store type: {type(vectorstore).__name__}")

## 3. Baseline Naive RAG Implementation

Now let's implement a traditional RAG system that embeds queries directly and retrieves based on cosine similarity.

In [None]:
def naive_rag_retrieval(query: str, k: int = 3) -> List[Tuple[Document, float]]:
    """
    Perform traditional RAG retrieval by directly embedding the user query.
    
    Args:
        query: User's question
        k: Number of top documents to retrieve
        
    Returns:
        List of (Document, similarity_score) tuples
    """
    # Directly retrieve using the query embedding
    results = vectorstore.similarity_search_with_score(query, k=k)
    return results


def generate_answer(query: str, context_docs: List[Document]) -> str:
    """
    Generate an answer using the LLM with retrieved context.
    
    Args:
        query: User's question
        context_docs: Retrieved relevant documents
        
    Returns:
        Generated answer string
    """
    # Prepare context from retrieved documents
    context = "\n\n".join([f"[Doc {i+1}]: {doc.page_content}" 
                           for i, doc in enumerate(context_docs)])
    
    # Create prompt template
    prompt_template = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful AI assistant. Use the provided context to answer the user's question accurately and concisely."),
        ("human", """Context:
{context}

Question: {question}

Answer:""")
    ])
    
    # Generate response
    messages = prompt_template.format_messages(context=context, question=query)
    response = llm.invoke(messages)
    
    return response.content


# Test the baseline RAG
print("="*80)
print("🔍 BASELINE NAIVE RAG TEST")
print("="*80)

test_query = "What are the main challenges with chunking strategies in RAG systems?"
print(f"\n📝 Query: {test_query}\n")

# Retrieve relevant documents
retrieved_docs = naive_rag_retrieval(test_query, k=3)

print("📚 Retrieved Documents:")
for i, (doc, score) in enumerate(retrieved_docs):
    print(f"\n[{i+1}] Similarity Score: {score:.4f}")
    print(f"Content: {doc.page_content[:150]}...")

# Generate answer
answer = generate_answer(test_query, [doc for doc, _ in retrieved_docs])
print(f"\n💡 Generated Answer:")
print(f"{answer}")
print("="*80)

## 4. HyDE Enhancement Implementation

Now let's implement the HyDE approach: generate a hypothetical answer first, then use it for retrieval.

In [None]:
def generate_hypothetical_document(query: str) -> str:
    """
    Generate a hypothetical ideal answer to the query using an LLM.
    This answer will be embedded and used for retrieval.
    
    Args:
        query: User's question
        
    Returns:
        Hypothetical document (ideal answer) as a string
    """
    # Create HyDE prompt template
    hyde_prompt = ChatPromptTemplate.from_messages([
        ("system", """You are an expert technical writer. Given a question, write a comprehensive, 
detailed passage that would contain the perfect answer to that question. 
Write as if you are authoring a technical document or research paper.
Be specific, detailed, and authoritative."""),
        ("human", "Question: {query}\n\nPassage:")
    ])
    
    # Generate hypothetical document
    messages = hyde_prompt.format_messages(query=query)
    response = llm.invoke(messages)
    
    return response.content


def hyde_retrieval(query: str, k: int = 3) -> Tuple[str, List[Tuple[Document, float]]]:
    """
    Perform HyDE-enhanced retrieval:
    1. Generate hypothetical document
    2. Embed the hypothetical document
    3. Retrieve using the hypothetical document embedding
    
    Args:
        query: User's question
        k: Number of top documents to retrieve
        
    Returns:
        Tuple of (hypothetical_doc, retrieved_results)
    """
    # Step 1: Generate hypothetical ideal answer
    hypothetical_doc = generate_hypothetical_document(query)
    
    # Step 2 & 3: Use hypothetical document for retrieval
    # The vectorstore will automatically embed the hypothetical doc
    results = vectorstore.similarity_search_with_score(hypothetical_doc, k=k)
    
    return hypothetical_doc, results


# Test HyDE retrieval
print("="*80)
print("🚀 HyDE-ENHANCED RAG TEST")
print("="*80)

test_query = "What are the main challenges with chunking strategies in RAG systems?"
print(f"\n📝 Query: {test_query}\n")

# Generate hypothetical document and retrieve
hypothetical_doc, retrieved_docs_hyde = hyde_retrieval(test_query, k=3)

print("🤖 Hypothetical Document Generated:")
print(f"{'-'*70}")
print(f"{hypothetical_doc}")
print(f"{'-'*70}\n")

print("📚 Retrieved Documents (using HyDE):")
for i, (doc, score) in enumerate(retrieved_docs_hyde):
    print(f"\n[{i+1}] Similarity Score: {score:.4f}")
    print(f"Content: {doc.page_content[:150]}...")

# Generate final answer
answer_hyde = generate_answer(test_query, [doc for doc, _ in retrieved_docs_hyde])
print(f"\n💡 Generated Answer (HyDE):")
print(f"{answer_hyde}")
print("="*80)

## 5. Comparative Evaluation

Let's systematically compare Baseline RAG vs. HyDE across multiple test queries.

In [None]:
# Define diverse test queries
test_queries = [
    "How does semantic search differ from keyword search?",
    "What is the lost in the middle problem?",
    "Explain the architecture of hybrid search systems",
    "Why are cross-encoder models slower but more accurate?",
    "What role do knowledge graphs play in RAG?"
]

def evaluate_retrieval_quality(results: List[Tuple[Document, float]]) -> Dict[str, float]:
    """Calculate basic retrieval quality metrics."""
    scores = [score for _, score in results]
    return {
        "avg_similarity": np.mean(scores),
        "max_similarity": np.max(scores),
        "min_similarity": np.min(scores),
        "score_variance": np.var(scores)
    }


# Run comparative evaluation
print("="*80)
print("📊 COMPARATIVE EVALUATION: BASELINE vs. HyDE")
print("="*80)

comparison_results = []

for i, query in enumerate(test_queries, 1):
    print(f"\n{'='*80}")
    print(f"Query {i}: {query}")
    print(f"{'='*80}")
    
    # Baseline retrieval
    baseline_results = naive_rag_retrieval(query, k=3)
    baseline_metrics = evaluate_retrieval_quality(baseline_results)
    
    # HyDE retrieval
    hyde_doc, hyde_results = hyde_retrieval(query, k=3)
    hyde_metrics = evaluate_retrieval_quality(hyde_results)
    
    # Compare
    print(f"\n📈 Retrieval Quality Metrics:")
    print(f"{'Metric':<20} {'Baseline':>12} {'HyDE':>12} {'Improvement':>12}")
    print(f"{'-'*60}")
    
    for metric in baseline_metrics:
        baseline_val = baseline_metrics[metric]
        hyde_val = hyde_metrics[metric]
        improvement = ((hyde_val - baseline_val) / baseline_val * 100) if baseline_val != 0 else 0
        
        print(f"{metric:<20} {baseline_val:>12.4f} {hyde_val:>12.4f} {improvement:>11.1f}%")
    
    # Store for summary
    comparison_results.append({
        "query": query,
        "baseline": baseline_metrics,
        "hyde": hyde_metrics
    })

print(f"\n{'='*80}")
print("📊 SUMMARY STATISTICS")
print(f"{'='*80}")

# Calculate overall improvement
avg_baseline_sim = np.mean([r["baseline"]["avg_similarity"] for r in comparison_results])
avg_hyde_sim = np.mean([r["hyde"]["avg_similarity"] for r in comparison_results])
overall_improvement = ((avg_hyde_sim - avg_baseline_sim) / avg_baseline_sim * 100)

print(f"\nAverage Similarity Score:")
print(f"  Baseline: {avg_baseline_sim:.4f}")
print(f"  HyDE:     {avg_hyde_sim:.4f}")
print(f"  Improvement: {overall_improvement:+.1f}%")

print(f"\n✨ HyDE shows {'superior' if overall_improvement > 0 else 'inferior'} performance overall!")

### When Does HyDE Excel?

Let's analyze specific cases where HyDE significantly outperforms baseline retrieval.

In [None]:
print("="*80)
print("🎯 ANALYZING HYDE ADVANTAGES")
print("="*80)

# Test cases where HyDE should excel
edge_cases = [
    {
        "query": "Why?",  # Very short, ambiguous query
        "context": "Short, vague queries"
    },
    {
        "query": "cross encoder vs bi encoder performance",  # Keyword-heavy
        "context": "Keyword-style queries"
    },
    {
        "query": "What's the problem with putting important info in the middle of long contexts?",  # Conversational
        "context": "Conversational, verbose queries"
    }
]

for case in edge_cases:
    query = case["query"]
    print(f"\n{'='*80}")
    print(f"📝 Query: '{query}'")
    print(f"📍 Context: {case['context']}")
    print(f"{'='*80}")
    
    # Baseline
    baseline_results = naive_rag_retrieval(query, k=2)
    print(f"\n🔵 Baseline Top Result:")
    print(f"  Score: {baseline_results[0][1]:.4f}")
    print(f"  Content: {baseline_results[0][0].page_content[:100]}...")
    
    # HyDE
    hyde_doc, hyde_results = hyde_retrieval(query, k=2)
    print(f"\n🟢 HyDE Top Result:")
    print(f"  Score: {hyde_results[0][1]:.4f}")
    print(f"  Content: {hyde_results[0][0].page_content[:100]}...")
    
    improvement = ((hyde_results[0][1] - baseline_results[0][1]) / baseline_results[0][1] * 100)
    print(f"\n📊 Improvement: {improvement:+.1f}%")

print("\n" + "="*80)
print("💡 KEY INSIGHTS")
print("="*80)
print("""
HyDE typically excels in these scenarios:
1. ✅ Short, ambiguous queries → HyDE expands them into detailed passages
2. ✅ Keyword-heavy queries → HyDE transforms into natural, semantic text
3. ✅ Conceptual questions → HyDE bridges the semantic gap better
4. ✅ Domain-specific terminology → LLM can use appropriate technical language

HyDE may struggle when:
1. ❌ Query already contains perfect keywords from documents
2. ❌ LLM's knowledge is outdated or incorrect for the domain
3. ❌ Additional latency is unacceptable (HyDE adds one extra LLM call)
""")

## 6. Advanced HyDE Variations

Let's explore more sophisticated HyDE techniques.

### Multi-Perspective HyDE

Generate multiple hypothetical documents from different perspectives to increase retrieval robustness.

In [None]:
def generate_multi_perspective_hypotheticals(query: str, num_perspectives: int = 3) -> List[str]:
    """
    Generate multiple hypothetical documents from different perspectives.
    
    Args:
        query: User's question
        num_perspectives: Number of different hypothetical documents to generate
        
    Returns:
        List of hypothetical documents
    """
    perspectives = [
        "Write a technical, detailed explanation",
        "Write a concise, beginner-friendly summary",
        "Write from a practical, implementation-focused perspective"
    ]
    
    hypotheticals = []
    
    for i in range(num_perspectives):
        perspective = perspectives[i] if i < len(perspectives) else perspectives[0]
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", f"""You are an expert technical writer. {perspective} that would 
answer the following question. Be specific and authoritative."""),
            ("human", "Question: {query}\n\nPassage:")
        ])
        
        messages = prompt.format_messages(query=query)
        response = llm.invoke(messages)
        hypotheticals.append(response.content)
    
    return hypotheticals


def multi_perspective_hyde_retrieval(query: str, k: int = 3, num_perspectives: int = 3) -> List[Tuple[Document, float]]:
    """
    Perform retrieval using multiple hypothetical documents and merge results.
    
    Args:
        query: User's question
        k: Number of documents to retrieve per perspective
        num_perspectives: Number of hypothetical documents to generate
        
    Returns:
        Merged and deduplicated list of retrieved documents with scores
    """
    # Generate multiple hypothetical documents
    hypotheticals = generate_multi_perspective_hypotheticals(query, num_perspectives)
    
    # Retrieve using each hypothetical document
    all_results = {}  # Use dict to deduplicate by document ID
    
    for hyde_doc in hypotheticals:
        results = vectorstore.similarity_search_with_score(hyde_doc, k=k)
        
        for doc, score in results:
            doc_id = doc.page_content  # Use content as ID
            # Keep the best (lowest) score for each unique document
            if doc_id not in all_results or score < all_results[doc_id][1]:
                all_results[doc_id] = (doc, score)
    
    # Sort by score and return top k
    sorted_results = sorted(all_results.values(), key=lambda x: x[1])
    return sorted_results[:k]


# Test multi-perspective HyDE
print("="*80)
print("🔍 MULTI-PERSPECTIVE HyDE")
print("="*80)

test_query = "What are embedding models and why are they important?"

print(f"\n📝 Query: {test_query}\n")

# Generate hypothetical documents
hypotheticals = generate_multi_perspective_hypotheticals(test_query, 3)

print("🤖 Generated Hypothetical Documents:")
for i, hyp in enumerate(hypotheticals, 1):
    print(f"\n[Perspective {i}]")
    print(f"{'-'*70}")
    print(f"{hyp[:200]}...")
    print(f"{'-'*70}")

# Retrieve using multi-perspective approach
multi_results = multi_perspective_hyde_retrieval(test_query, k=3, num_perspectives=3)

print(f"\n📚 Final Merged Results (Top 3):")
for i, (doc, score) in enumerate(multi_results, 1):
    print(f"\n[{i}] Score: {score:.4f}")
    print(f"Content: {doc.page_content[:150]}...")

print("\n" + "="*80)
print("💡 Multi-Perspective Advantages:")
print("="*80)
print("""
✅ Increased Retrieval Robustness: Multiple views increase chance of finding relevant docs
✅ Better Coverage: Different perspectives may retrieve complementary information
✅ Reduced False Negatives: If one perspective fails, others may succeed
❌ Higher Latency: Requires multiple LLM calls for generation
❌ More Complex: Result fusion and deduplication needed
""")

### Domain-Specific HyDE Prompt Engineering

Customize HyDE prompts for different domains to improve performance.

In [None]:
# Domain-specific prompt templates
DOMAIN_PROMPTS = {
    "technical": """You are a senior software engineer and technical architect. 
Write a detailed technical documentation passage that answers the question below. 
Include specific terminology, architectural patterns, and implementation details.""",
    
    "scientific": """You are a research scientist writing for a peer-reviewed journal. 
Write an academic passage with precise definitions, citations to methodologies, 
and evidence-based explanations that addresses the question below.""",
    
    "business": """You are a business analyst and consultant. Write a clear, 
actionable passage that answers the question from a business perspective, 
including ROI considerations, strategic implications, and practical applications.""",
    
    "tutorial": """You are an experienced educator and technical trainer. 
Write a step-by-step, beginner-friendly passage that explains the answer 
to the question below with examples and clear explanations."""
}


def domain_aware_hyde(query: str, domain: str = "technical") -> str:
    """
    Generate a hypothetical document tailored to a specific domain.
    
    Args:
        query: User's question
        domain: Domain type ('technical', 'scientific', 'business', 'tutorial')
        
    Returns:
        Domain-specific hypothetical document
    """
    system_prompt = DOMAIN_PROMPTS.get(domain, DOMAIN_PROMPTS["technical"])
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "Question: {query}\n\nPassage:")
    ])
    
    messages = prompt.format_messages(query=query)
    response = llm.invoke(messages)
    
    return response.content


# Demonstrate domain-specific HyDE
print("="*80)
print("🎨 DOMAIN-SPECIFIC HyDE EXAMPLES")
print("="*80)

test_query = "What are the benefits of using hybrid search in RAG systems?"

domains = ["technical", "business", "tutorial"]

for domain in domains:
    print(f"\n{'='*80}")
    print(f"🏷️  Domain: {domain.upper()}")
    print(f"{'='*80}")
    
    hyde_doc = domain_aware_hyde(test_query, domain)
    print(f"\n{hyde_doc[:300]}...\n")

print("="*80)
print("💡 Domain-Specific Prompt Engineering Benefits:")
print("="*80)
print("""
✅ Better Semantic Alignment: Matches the writing style of your knowledge base
✅ Improved Terminology: Uses domain-appropriate vocabulary
✅ Enhanced Relevance: LLM generates content that better matches document style
✅ Flexibility: Can adapt to different sections of a heterogeneous knowledge base

Best Practice: Analyze your knowledge base's writing style and adapt prompts accordingly!
""")

## 7. Results Analysis and Conclusions

Let's summarize our findings and provide actionable recommendations.

In [None]:
print("="*80)
print("📊 FINAL ANALYSIS & RECOMMENDATIONS")
print("="*80)

print("""
## Key Findings

### 1. Performance Comparison

Baseline RAG:
- ✅ Fast: Direct query embedding (no extra LLM call)
- ✅ Simple: Straightforward implementation
- ❌ Query-Document Mismatch: Struggles with semantic gaps
- ❌ Keyword-Dependent: Poor on vague or conversational queries

HyDE-Enhanced RAG:
- ✅ Better Semantic Alignment: Transforms query to document-like text
- ✅ Improved Retrieval Quality: Higher similarity scores on average
- ✅ Robust to Query Variations: Handles short, vague queries better
- ❌ Additional Latency: +1 LLM call (~200-500ms overhead)
- ❌ Token Costs: Extra tokens for hypothetical document generation

### 2. When to Use HyDE

✅ RECOMMENDED FOR:
- Complex, conceptual queries requiring semantic understanding
- Short or ambiguous user inputs that need expansion
- Domain-specific knowledge bases with technical terminology
- Systems where accuracy is more critical than speed
- B2B applications with sophisticated users

❌ NOT RECOMMENDED FOR:
- Simple keyword-based lookups (e.g., "Find document about X")
- Real-time systems with strict latency requirements (<100ms)
- Cost-sensitive applications with high query volumes
- Knowledge bases where LLM may lack domain expertise

### 3. Optimization Strategies

🔧 Hybrid Approach:
- Use query classification to route simple queries to baseline
- Apply HyDE only for complex queries detected by heuristics
- Example heuristic: Query length < 5 words → baseline, else → HyDE

🔧 Caching:
- Cache hypothetical documents for frequently asked questions
- Reduce redundant LLM calls for similar queries

🔧 Async Processing:
- Generate hypothetical document while performing baseline retrieval
- Combine results from both approaches for best coverage

### 4. Trade-off Analysis

| Dimension       | Baseline | HyDE | Winner |
|----------------|----------|------|--------|
| Speed          | ⚡⚡⚡    | ⚡⚡   | Baseline |
| Accuracy       | ⭐⭐     | ⭐⭐⭐  | HyDE |
| Cost           | 💰       | 💰💰  | Baseline |
| Simplicity     | ✅✅✅    | ✅✅   | Baseline |
| Robustness     | ⭐⭐     | ⭐⭐⭐  | HyDE |

### 5. Production Best Practices

1️⃣ Start Simple: Implement baseline first, measure performance
2️⃣ A/B Test: Compare baseline vs HyDE on representative queries
3️⃣ Monitor Metrics: Track latency, cost, and user satisfaction
4️⃣ Iterate: Fine-tune prompts based on failure analysis
5️⃣ Consider Hybrid: Use query classification for intelligent routing

### 6. Next Steps in Your RAG Journey

After mastering HyDE, explore:
- 🔄 Multi-Query Decomposition (Demo #2)
- 🔍 Hybrid Search with BM25 (Demo #3)
- 🏗️ Hierarchical Retrieval (Demo #4)
- 🎯 Cross-Encoder Re-ranking (Demo #5)
""")

## 🎓 Summary and Key Takeaways

**Congratulations!** You've successfully implemented HyDE-enhanced RAG and compared it against baseline RAG.

### What You Learned:
1. ✅ The **query-document asymmetry problem** and why it matters
2. ✅ How to implement **baseline Naive RAG** with direct query embedding
3. ✅ How to build **HyDE-enhanced RAG** with hypothetical document generation
4. ✅ Comparative evaluation methodologies for RAG systems
5. ✅ Advanced HyDE variations: multi-perspective and domain-specific approaches

### Key Insights:
- HyDE transforms retrieval from query→document to answer→answer similarity
- The technique adds latency but significantly improves semantic alignment
- Domain-specific prompt engineering can further enhance HyDE performance
- Multi-perspective HyDE increases robustness at the cost of complexity

### Production Considerations:
- Measure the latency-accuracy trade-off for your specific use case
- Consider hybrid approaches that route queries intelligently
- Monitor token costs and implement caching strategies
- Continuously evaluate and iterate on prompt templates

---

## 📚 References and Further Reading

1. **HyDE Paper**: "Precise Zero-Shot Dense Retrieval without Relevance Labels" (Gao et al., 2022)
2. **LangChain Documentation**: https://python.langchain.com/docs/
3. **ChromaDB**: https://docs.trychroma.com/
4. **Sentence Transformers**: https://www.sbert.net/

---

**Next Demo**: Multi-Query and Sub-Query Decomposition (Demo #2)