# Demo #1: HyDE (Hypothetical Document Embeddings) - Query Enhancement

## Overview

This demo demonstrates **HyDE (Hypothetical Document Embeddings)**, an advanced query enhancement technique that dramatically improves semantic matching between queries and documents.

### Core Concept

Traditional RAG systems embed the user's query directly and search for similar document embeddings. However, queries are typically short questions while documents are informative passages. This asymmetry can lead to suboptimal retrieval.

**HyDE solves this by:**
1. Using an LLM to generate a hypothetical answer document from the query
2. Embedding this hypothetical document (which has similar characteristics to real documents)
3. Searching for documents similar to this hypothetical answer
4. This creates an "answer-to-answer" similarity paradigm instead of "question-to-answer"

### Key Benefits
- Better semantic alignment between query and document embeddings
- Improved retrieval precision, especially for complex queries
- Works without requiring any fine-tuning or additional data

### Citation
- **Paper**: "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE original paper)
  - Link: https://hf.co/papers/2212.10496
- **Paper**: "ARAGOG: Advanced RAG Output Grading" - Evaluates HyDE effectiveness
  - Link: https://hf.co/papers/2404.01037

## 1. Setup and Installation

In [None]:
# Install required packages
# Run this cell if packages are not already installed
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai python-dotenv

In [None]:
import os
from dotenv import load_dotenv
from pathlib import Path

# llama-index core imports
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
    StorageContext,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

# Azure OpenAI imports
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

print("✓ All imports successful")

## 2. Configure Azure OpenAI

Make sure you have a `.env` file in the project root with:
```
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_DEPLOYMENT_NAME=your_gpt4_deployment
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=your_embedding_deployment
```

In [None]:
# Load environment variables
load_dotenv()

# Configure Azure OpenAI LLM
azure_llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    temperature=0.1,
)

# Configure Azure OpenAI Embeddings
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

# Set global defaults
Settings.llm = azure_llm
Settings.embed_model = azure_embed
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI configured successfully")
print(f"  LLM: {os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')}")
print(f"  Embeddings: {os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT')}")

## 3. Load and Process Documents

We'll use machine learning concept documents as our knowledge base. This focused domain allows us to clearly demonstrate HyDE's effectiveness.

In [None]:
# Define data path - using existing ML concepts data
data_path = "../RAG_v2/data/ml_concepts"

# Check if path exists
if not Path(data_path).exists():
    print(f"⚠ Warning: Data path not found: {data_path}")
    print("Please adjust the path to your data directory")
else:
    print(f"✓ Data path found: {data_path}")

# Load documents
reader = SimpleDirectoryReader(data_path)
documents = reader.load_data()

print(f"\nLoaded {len(documents)} documents:")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {Path(doc.metadata['file_path']).name} ({len(doc.text)} characters)")

## 4. Chunk Documents

We use sentence-based chunking with moderate chunk size (512 tokens) and minimal overlap (50 tokens).

In [None]:
# Initialize sentence splitter
splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

# Parse documents into nodes (chunks)
nodes = splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(nodes)} chunks from {len(documents)} documents")
print(f"\nExample chunk:")
print(f"  Length: {len(nodes[0].text)} characters")
print(f"  Content preview: {nodes[0].text[:200]}...")

## 5. Build Vector Index

Create an in-memory vector store index using Azure OpenAI embeddings.

In [None]:
# Create vector store index
index = VectorStoreIndex(nodes, embed_model=azure_embed)

print("✓ Vector index created successfully")
print(f"  Total nodes indexed: {len(nodes)}")

## 6. Create Baseline Query Engine

First, we'll create a standard query engine to establish a baseline for comparison.

In [None]:
# Create baseline query engine
baseline_query_engine = index.as_query_engine(
    similarity_top_k=3,
    llm=azure_llm,
)

print("✓ Baseline query engine created")
print("  Configuration:")
print("    - Retrieves top 3 most similar chunks")
print("    - Uses direct query embedding")

## 7. Test Baseline Retrieval

Let's test the baseline system with a technical query.

In [None]:
# Define test query
test_query = "What are the key advantages and disadvantages of using ensemble methods in machine learning?"

print("Test Query:")
print(f"  {test_query}")
print("\n" + "="*80)

# Query baseline engine
baseline_response = baseline_query_engine.query(test_query)

print("\n🔍 BASELINE RETRIEVAL")
print("="*80)
print(f"\nGenerated Answer:\n{baseline_response.response}")

# Display retrieved source nodes
print("\n" + "-"*80)
print("Retrieved Chunks (Top 3):")
print("-"*80)
for i, node in enumerate(baseline_response.source_nodes, 1):
    print(f"\nChunk {i} (Score: {node.score:.4f}):")
    print(f"Source: {Path(node.metadata.get('file_path', 'Unknown')).name}")
    print(f"Content: {node.text[:300]}...")
    print("-"*80)

## 8. Implement HyDE Query Enhancement

Now we'll create the HyDE-enhanced query engine. The key innovation is generating a hypothetical answer document before retrieval.

In [None]:
# Create HyDE query transformation
hyde_transform = HyDEQueryTransform(
    llm=azure_llm,
    include_original=False,  # Only use the hypothetical document, not the original query
)

# Wrap baseline query engine with HyDE transformation
hyde_query_engine = TransformQueryEngine(
    baseline_query_engine,
    query_transform=hyde_transform,
)

print("✓ HyDE query engine created")
print("  Enhancement: Query → LLM generates hypothetical answer → Embed hypothetical answer → Retrieve")

## 9. Visualize HyDE Process

Before running the full query, let's manually see what hypothetical document HyDE generates.

In [None]:
# Manually run the HyDE transformation to see the hypothetical document
from llama_index.core.schema import QueryBundle

# Create query bundle
query_bundle = QueryBundle(query_str=test_query)

# Generate hypothetical document
transformed_query = hyde_transform.run(query_bundle)

print("🔬 HYDE TRANSFORMATION PROCESS")
print("="*80)
print(f"\nOriginal Query:\n{test_query}")
print("\n" + "-"*80)
print(f"\nHypothetical Document Generated by LLM:")
print("-"*80)
print(transformed_query.query_str)
print("\n" + "="*80)
print("\n💡 Key Insight:")
print("   The hypothetical document is more similar in structure to the actual documents")
print("   in our knowledge base, leading to better semantic matching.")

## 10. Test HyDE-Enhanced Retrieval

Now let's run the same query through the HyDE-enhanced engine.

In [None]:
# Query HyDE-enhanced engine
hyde_response = hyde_query_engine.query(test_query)

print("\n🎯 HYDE-ENHANCED RETRIEVAL")
print("="*80)
print(f"\nGenerated Answer:\n{hyde_response.response}")

# Display retrieved source nodes
print("\n" + "-"*80)
print("Retrieved Chunks (Top 3):")
print("-"*80)
for i, node in enumerate(hyde_response.source_nodes, 1):
    print(f"\nChunk {i} (Score: {node.score:.4f}):")
    print(f"Source: {Path(node.metadata.get('file_path', 'Unknown')).name}")
    print(f"Content: {node.text[:300]}...")
    print("-"*80)

## 11. Side-by-Side Comparison

Let's create a clear comparison between baseline and HyDE approaches.

In [None]:
import pandas as pd

# Create comparison dataframe
comparison_data = []

for i in range(min(3, len(baseline_response.source_nodes))):
    baseline_node = baseline_response.source_nodes[i]
    hyde_node = hyde_response.source_nodes[i]
    
    comparison_data.append({
        "Rank": i + 1,
        "Baseline Source": Path(baseline_node.metadata.get('file_path', 'Unknown')).name,
        "Baseline Score": f"{baseline_node.score:.4f}",
        "HyDE Source": Path(hyde_node.metadata.get('file_path', 'Unknown')).name,
        "HyDE Score": f"{hyde_node.score:.4f}",
    })

comparison_df = pd.DataFrame(comparison_data)

print("\n📊 RETRIEVAL COMPARISON")
print("="*80)
print("\nRetrieved Documents by Rank:")
print(comparison_df.to_string(index=False))

print("\n" + "="*80)
print("\nAnswer Quality Comparison:")
print("-"*80)
print(f"\nBaseline Answer Length: {len(baseline_response.response)} characters")
print(f"HyDE Answer Length: {len(hyde_response.response)} characters")

## 12. Additional Test Cases

Let's test with a few more queries to demonstrate HyDE's robustness.

In [None]:
# Define additional test queries
test_queries = [
    "How does the kernel trick work in support vector machines?",
    "Explain the difference between supervised and unsupervised learning with examples",
    "What is backpropagation and why is it important for neural networks?",
]

print("\n🔬 ADDITIONAL TEST CASES")
print("="*80)

for i, query in enumerate(test_queries, 1):
    print(f"\nTest Case {i}: {query}")
    print("-"*80)
    
    # Get responses
    baseline = baseline_query_engine.query(query)
    hyde = hyde_query_engine.query(query)
    
    # Compare top retrieved document
    baseline_top_source = Path(baseline.source_nodes[0].metadata.get('file_path', 'Unknown')).name
    hyde_top_source = Path(hyde.source_nodes[0].metadata.get('file_path', 'Unknown')).name
    
    print(f"  Baseline Top Source: {baseline_top_source} (Score: {baseline.source_nodes[0].score:.4f})")
    print(f"  HyDE Top Source: {hyde_top_source} (Score: {hyde.source_nodes[0].score:.4f})")
    
    if baseline_top_source != hyde_top_source:
        print("  ⚡ HyDE retrieved a different document!")
    else:
        print("  ✓ Both methods retrieved the same top document")

print("\n" + "="*80)

## 13. Data Flow Visualization

Let's create a visual representation of the HyDE workflow.

In [None]:
print("\n📈 HYDE DATA FLOW DIAGRAM")
print("="*80)
print("""
BASELINE RAG PIPELINE:
┌─────────────────┐
│  User Query     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Embed Query    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vector Search  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Retrieved Chunks│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LLM Generation  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Final Answer   │
└─────────────────┘

HYDE-ENHANCED PIPELINE:
┌─────────────────┐
│  User Query     │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│  LLM Generates              │
│  Hypothetical Answer Doc    │ ← KEY INNOVATION
└────────┬────────────────────┘
         │
         ▼
┌─────────────────────────────┐
│  Embed Hypothetical Doc     │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────┐
│  Vector Search  │ ← Answer-to-Answer Similarity
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Retrieved Chunks│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LLM Generation  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Final Answer   │
└─────────────────┘
""")
print("="*80)

## 14. Key Takeaways and Insights

### When to Use HyDE:
- **Complex technical queries** where users ask questions in one style but documents are written in another
- **Domain-specific applications** where terminology matters
- **When query-document asymmetry** is causing poor retrieval

### Trade-offs:
- **Pros:**
  - Significantly improves semantic matching
  - No training or fine-tuning required
  - Works with any embedding model
  
- **Cons:**
  - Adds one extra LLM call (increased latency and cost)
  - Generated hypothetical document might be incorrect (but still semantically useful)
  - Less effective for very simple, keyword-based queries

### Performance Considerations:
- The hypothetical document generation adds ~1-2 seconds of latency
- This is usually worth it for the improved retrieval quality
- Can be cached for frequently repeated queries

### Real-World Applications:
- Technical documentation search
- Medical literature retrieval
- Legal document search
- Scientific paper discovery
- Code search and software documentation

## 15. Summary

In this demo, we've successfully implemented and compared:

1. **Baseline RAG**: Direct query embedding → vector search
2. **HyDE-Enhanced RAG**: Query → LLM generates hypothetical answer → embed hypothetical answer → vector search

The key insight is that by generating a hypothetical answer document, we transform the retrieval problem from "question-to-answer" matching into "answer-to-answer" matching, which is more semantically aligned and often produces better results.

### Next Steps:
- Experiment with different types of queries
- Try HyDE with different domains and document types
- Combine HyDE with other advanced techniques (coming in subsequent demos)
- Measure performance metrics systematically

---

**References:**
- Original HyDE Paper: https://hf.co/papers/2212.10496
- ARAGOG Evaluation: https://hf.co/papers/2404.01037