# Demo #1: HyDE (Hypothetical Document Embeddings) - Query Enhancement

## Overview

This demo demonstrates how **HyDE (Hypothetical Document Embeddings)** can dramatically improve semantic retrieval by generating a hypothetical answer document before performing the search.

### Core Concepts:
- **Query Enhancement**: Transforming user queries before retrieval
- **HyDE Paradigm**: Answer-to-answer similarity search (instead of question-to-answer)
- **Pre-retrieval Optimization**: Using LLM-generated context to improve semantic matching

### Why HyDE Works:
Traditional RAG embeds the user's question and searches for similar documents. However:
- Questions and answers often use different vocabulary
- Questions are typically short and lack context
- Documents contain answers, not questions

HyDE solves this by:
1. Using an LLM to generate a hypothetical answer to the question
2. Embedding this hypothetical answer (which resembles actual documents)
3. Searching for similar documents using answer-to-answer similarity

### Demo Structure:
1. Setup and data ingestion
2. Build baseline RAG pipeline
3. Implement HyDE enhancement
4. Comparative evaluation
5. Results visualization

## 1. Environment Setup and Dependencies

In [None]:
# Install required packages
# Run this cell only once
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai python-dotenv

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
from pathlib import Path

# LlamaIndex core components
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.core.indices.query.query_transform import HyDEQueryTransform

# Azure OpenAI components
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

print("✓ All imports successful")

## 2. Configure Azure OpenAI Connection

**Important**: Create a `.env` file in the project root with your Azure OpenAI credentials:

```
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_DEPLOYMENT_NAME=your-gpt4-deployment
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=your-embedding-deployment
```

In [None]:
# Load environment variables
load_dotenv()

# Azure OpenAI configuration
api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
llm_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")

# Validate configuration
if not all([api_key, azure_endpoint, llm_deployment, embedding_deployment]):
    raise ValueError("Missing required Azure OpenAI configuration. Check your .env file.")

print("✓ Azure OpenAI configuration loaded")
print(f"  Endpoint: {azure_endpoint}")
print(f"  LLM Deployment: {llm_deployment}")
print(f"  Embedding Deployment: {embedding_deployment}")

In [None]:
# Initialize Azure OpenAI LLM
azure_llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=llm_deployment,
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
    temperature=0.1,  # Low temperature for consistent responses
)

# Initialize Azure OpenAI Embedding Model
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=embedding_deployment,
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

# Set global defaults for LlamaIndex
Settings.llm = azure_llm
Settings.embed_model = azure_embed
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI models initialized")

## 3. Load and Process Documents

We'll use a small knowledge base of machine learning concepts to demonstrate HyDE.

In [None]:
# Define data directory
data_dir = Path("./data/ml_concepts")

# Load documents
documents = SimpleDirectoryReader(
    input_dir=str(data_dir),
    required_exts=['.md']
).load_data()

print(f"✓ Loaded {len(documents)} documents")
for doc in documents:
    filename = Path(doc.metadata.get('file_name', 'unknown')).stem
    print(f"  - {filename} ({len(doc.text)} characters)")

In [None]:
# Create text splitter for chunking
text_splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)

# Parse documents into nodes (chunks)
nodes = text_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(nodes)} text chunks")
print(f"\nExample chunk:")
print(f"Text: {nodes[0].text[:200]}...")
print(f"Source: {nodes[0].metadata.get('file_name', 'unknown')}")

## 4. Build Baseline RAG Pipeline

First, we'll create a standard RAG pipeline without HyDE to establish a baseline.

In [None]:
# Create vector index
index = VectorStoreIndex(
    nodes=nodes,
    embed_model=azure_embed
)

print("✓ Vector index created")

In [None]:
# Create baseline query engine
baseline_query_engine = index.as_query_engine(
    llm=azure_llm,
    similarity_top_k=3  # Retrieve top 3 most similar chunks
)

print("✓ Baseline query engine created")

## 5. Test Baseline RAG

Let's test the baseline with a query that requires understanding ML concepts.

In [None]:
# Define test query
test_query = "What are the main differences between ensemble methods that build trees sequentially versus in parallel?"

print(f"Test Query: {test_query}")
print("=" * 80)

In [None]:
# Execute baseline query
baseline_response = baseline_query_engine.query(test_query)

print("\n📊 BASELINE APPROACH (Standard RAG)")
print("=" * 80)
print(f"\nAnswer:\n{baseline_response.response}")
print("\n" + "=" * 80)
print("Retrieved Chunks:")
for i, node in enumerate(baseline_response.source_nodes, 1):
    print(f"\n[Chunk {i}] Score: {node.score:.4f} | Source: {node.metadata.get('file_name', 'unknown')}")
    print(f"{node.text[:300]}...")
    print("-" * 80)

## 6. Implement HyDE Query Enhancement

Now we'll enhance the query engine with HyDE. The process:
1. User query → LLM generates hypothetical answer
2. Hypothetical answer → Embedded
3. Search using answer embedding (not query embedding)
4. Retrieved chunks → Final generation

In [None]:
# Create HyDE query transformation
hyde_transform = HyDEQueryTransform(
    llm=azure_llm,
    include_original=False  # Use only the hypothetical document, not the original query
)

print("✓ HyDE transformation created")

In [None]:
# Wrap the baseline query engine with HyDE transformation
hyde_query_engine = TransformQueryEngine(
    query_engine=baseline_query_engine,
    query_transform=hyde_transform
)

print("✓ HyDE-enhanced query engine created")

## 7. Visualize HyDE Transformation

Let's see what hypothetical document HyDE generates for our query.

In [None]:
# Generate hypothetical document to see the transformation
from llama_index.core.schema import QueryBundle

query_bundle = QueryBundle(query_str=test_query)
transformed_query = hyde_transform.run(query_bundle)

print("\n🔄 HYDE TRANSFORMATION")
print("=" * 80)
print(f"\nOriginal Query:\n{test_query}")
print("\n" + "=" * 80)
print(f"\nHypothetical Document Generated by LLM:\n{transformed_query.query_str}")
print("\n" + "=" * 80)

## 8. Execute HyDE-Enhanced Query

Now let's run the same query through the HyDE-enhanced pipeline.

In [None]:
# Execute HyDE query
hyde_response = hyde_query_engine.query(test_query)

print("\n🚀 HYDE-ENHANCED APPROACH")
print("=" * 80)
print(f"\nAnswer:\n{hyde_response.response}")
print("\n" + "=" * 80)
print("Retrieved Chunks:")
for i, node in enumerate(hyde_response.source_nodes, 1):
    print(f"\n[Chunk {i}] Score: {node.score:.4f} | Source: {node.metadata.get('file_name', 'unknown')}")
    print(f"{node.text[:300]}...")
    print("-" * 80)

## 9. Side-by-Side Comparison

Let's compare the retrieved chunks and answers from both approaches.

In [None]:
# Comparison function
def compare_approaches(baseline_resp, hyde_resp):
    print("\n" + "=" * 100)
    print("📊 COMPARATIVE ANALYSIS: BASELINE vs. HYDE")
    print("=" * 100)
    
    # Compare retrieved sources
    print("\n1. RETRIEVED SOURCES COMPARISON")
    print("-" * 100)
    
    baseline_sources = [node.metadata.get('file_name', 'unknown') for node in baseline_resp.source_nodes]
    hyde_sources = [node.metadata.get('file_name', 'unknown') for node in hyde_resp.source_nodes]
    
    print(f"\nBaseline sources: {baseline_sources}")
    print(f"HyDE sources: {hyde_sources}")
    
    # Compare relevance scores
    print("\n2. RELEVANCE SCORES COMPARISON")
    print("-" * 100)
    
    print("\nBaseline Scores:")
    for i, node in enumerate(baseline_resp.source_nodes, 1):
        print(f"  Chunk {i}: {node.score:.4f}")
    
    print("\nHyDE Scores:")
    for i, node in enumerate(hyde_resp.source_nodes, 1):
        print(f"  Chunk {i}: {node.score:.4f}")
    
    # Compare answers
    print("\n3. GENERATED ANSWERS COMPARISON")
    print("-" * 100)
    
    print(f"\nBaseline Answer ({len(baseline_resp.response)} characters):\n{baseline_resp.response}")
    print(f"\n{'-' * 100}")
    print(f"\nHyDE Answer ({len(hyde_resp.response)} characters):\n{hyde_resp.response}")
    
    print("\n" + "=" * 100)

# Run comparison
compare_approaches(baseline_response, hyde_response)

## 10. Test with Additional Queries

Let's test both approaches with more queries to see consistent patterns.

In [None]:
# Additional test queries
additional_queries = [
    "How do you prevent a model from memorizing training data?",
    "What technique uses multiple random trees for prediction?",
    "Which algorithm finds optimal separating boundaries in high-dimensional spaces?"
]

print("\n" + "=" * 100)
print("🔬 ADDITIONAL QUERY TESTS")
print("=" * 100)

for i, query in enumerate(additional_queries, 1):
    print(f"\n\n{'=' * 100}")
    print(f"Query {i}: {query}")
    print("=" * 100)
    
    # Baseline
    baseline_resp = baseline_query_engine.query(query)
    print(f"\n📊 Baseline: {[n.metadata.get('file_name', 'unknown') for n in baseline_resp.source_nodes]}")
    print(f"Scores: {[f'{n.score:.4f}' for n in baseline_resp.source_nodes]}")
    
    # HyDE
    hyde_resp = hyde_query_engine.query(query)
    print(f"\n🚀 HyDE: {[n.metadata.get('file_name', 'unknown') for n in hyde_resp.source_nodes]}")
    print(f"Scores: {[f'{n.score:.4f}' for n in hyde_resp.source_nodes]}")

## 11. Key Takeaways and Analysis

### What We Learned:

1. **HyDE improves semantic matching**: By generating a hypothetical answer that resembles actual documents, we bridge the vocabulary gap between questions and answers.

2. **Answer-to-answer similarity**: Traditional RAG uses question-to-answer similarity, which can be suboptimal. HyDE uses answer-to-answer similarity, which is more effective.

3. **Pre-retrieval optimization**: HyDE is a pre-retrieval technique - it transforms the query before searching, not after.

4. **Trade-offs**: HyDE adds one extra LLM call to generate the hypothetical document, increasing latency and cost. The benefit is improved retrieval quality.

### When to Use HyDE:
- Queries where vocabulary mismatch is a problem
- Technical domains with specialized terminology
- When retrieval quality is more important than latency
- Complex questions that require conceptual understanding

### Data Flow Visualization:
```
Standard RAG:
User Query → Embed Query → Vector Search → Retrieved Chunks → LLM Generation → Answer

HyDE RAG:
User Query → LLM Generate Hypothetical Answer → Embed Hypothetical Answer → 
Vector Search → Retrieved Chunks → LLM Generation → Answer
```

### Next Steps:
In the next demo, we'll explore **Multi-Query Decomposition** for handling complex queries that require information from multiple sources.