# HyDe RAG: Hypothetical Document Enhanced Retrieval-Augmented Generation 🚀📄

This project implements **HyDe RAG** (Hypothetical Document Enhanced Retrieval-Augmented Generation), an advanced RAG pipeline that improves retrieval and answer generation by leveraging hypothetical answers. Instead of retrieving context solely based on the user query, HyDe RAG first generates one or more hypothetical answers to the query, embeds them, and uses these embeddings to retrieve the most relevant document chunks. This approach can improve retrieval quality, especially for complex or ambiguous queries.

Inspired from the research paper of HyDe : [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496)
---

## Features ✨

- **Single and Multiple HyDe Modes:**  
    Generate one or several hypothetical answers for a query, and use their embeddings (averaged in the multi-hypo case) for retrieval. 🤔➡️📚
- **Flexible LLM and Embedding Integration:**  
    Uses Google Gemini for LLM and Google GenAI for embeddings. 🤖🔗
- **Semantic Scoring:**  
    Retrieved chunks are scored for semantic similarity to the hypothetical answer(s). 🧠📈
- **Chunked Document Processing:**  
    Documents are split into manageable chunks for efficient retrieval and ranking. ✂️📄
- **Easy-to-Use Functions:**  
    Includes `HydeRAG` and `HydeRAG_Multiple` for single and multi-hypothetical workflows. 🛠️

---

## Workflow 🔄

1. **Document Loading & Chunking:**  
     Documents are loaded from the `data` directory and split into chunks using `SentenceSplitter`. 📂✂️

2. **Embedding & LLM Setup:**  
     Google GenAI embeddings and Gemini LLM are configured. 🧩🤖

3. **Indexing:**  
     Chunks are indexed using a vector store for fast similarity search. 🗂️⚡

4. **Hypothetical Answer Generation:**  
     For a given query, the LLM generates one or more plausible answers (hypotheticals). 💡

5. **Embedding Hypotheticals:**  
     Each hypothetical answer is embedded. In multi-hypo mode, embeddings are averaged. 🧬➗

6. **Retrieval:**  
     The hypothetical embedding(s) are used to retrieve the most relevant document chunks. 🎯

7. **Scoring & Ranking:**  
     Retrieved chunks are scored (cosine similarity or semantic evaluator) and ranked. 🏅

8. **Final Answer Generation:**  
     The top-ranked chunks are provided as context to the LLM to generate the final answer. 📝

---
## 📖 References

- [HyDe RAG Paper (arXiv:2212.10496)](https://arxiv.org/pdf/2212.10496)

---
# Custom Chunk Retrieval Implementation 🛠️

Unlike out-of-the-box retrieval solutions, this project implements chunk retrieval from scratch for maximum flexibility and transparency:

- **Document Chunking:** Documents are split into manageable chunks using a sentence-based splitter.
- **Embedding:** Both queries (or their hypothetical answers) and document chunks are embedded using Google GenAI.
- **Similarity Scoring:** Retrieved chunks are scored using either cosine similarity or semantic evaluators, allowing for fine-grained ranking.
- **Manual Ranking:** The code sorts and selects the top-k chunks based on similarity scores, ensuring only the most relevant context is used for answer generation.

This custom approach allows for experimentation with different retrieval strategies, scoring methods, and chunking techniques, making the system highly adaptable and research-friendly.

---

Feel free to refer to the code cells for detailed implementation of each step!

In [None]:
# !pip install llama_index.llms.langchain
# !pip install langchain_community
# %pip install llama-index-embeddings-google-genai
import os
GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)
# jupyter nbconvert --ClearMetadataPreprocessor.enabled=True --to notebook --inplace HyDe_RAG.ipynb

In [None]:
# !pip install llama_index
from llama_index.core import VectorStoreIndex, Settings, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
documents = SimpleDirectoryReader("data").load_data()
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
embed_model = GoogleGenAIEmbedding(
    model_name="text-embedding-004",
    embed_batch_size=100,
    api_key=""  
)
Settings.embed_model = embed_model
Settings.llm = llm  
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=5,embed_model=embed_model,embeddings=embed_model)

In [None]:
def generate_hypothetical_answer(query, llm):
    prompt = f"""
You are a helpful assistant. Generate a short paragraph that might answer the user's question.

Question: {query}

Hypothetical Answer:
"""
    response = llm.invoke(prompt)
    return response.content.strip()

In [26]:
res=generate_hypothetical_answer("what is prompt engineering ?",llm)

In [27]:
print(res)

Prompt engineering is the process of designing, refining, and optimizing the inputs (prompts) given to artificial intelligence models, especially large language models (LLMs), to achieve desired outputs. It involves understanding how these models interpret and respond to text, then crafting precise instructions, examples, and context to guide their behavior, improve accuracy, and unlock specific capabilities for various tasks.


In [28]:
embedding = embed_model.get_text_embedding(res)

In [30]:
print(embedding)
print(len(embedding))

[-0.038928118, -0.021924319, -0.047387216, -0.024699317, 0.024130518, 0.047847446, 0.024508096, 0.03380397, -0.015584286, -0.0073079923, -0.04244941, 0.006492105, 0.029012477, -0.00811799, 0.023100581, -0.074430674, 0.018327028, 0.032765668, -0.1046555, 0.0075687724, 0.0008435602, -0.01599861, -0.045755155, -0.0812758, -0.017151818, 0.035411023, 0.030723568, -0.026784474, 0.0190255, -0.032942865, 0.06873405, 0.06443513, 0.044786695, -0.08529241, -0.01751432, -0.0015939691, 0.00023016837, 0.052878816, 0.050034266, -0.015983835, -0.062446117, 0.01304924, -0.049132917, 0.027779132, -0.012147965, 0.0061806957, -0.03554646, 0.037413374, -0.0369422, 0.03924568, 0.022033593, -0.03438433, -0.027625127, 0.030073874, -0.045050923, 0.0021168902, -0.028933953, 0.057786524, 0.031006405, 0.014710289, -0.0020874338, -0.047296442, -0.017321639, -0.024701672, 0.0041553415, -0.011589824, -0.033095412, -0.0635702, -0.06775403, 0.011659481, 0.013382095, 0.047499366, -0.022251086, 0.011511231, -0.026293436

In [39]:
query="what is prompt engineering ?"

In [73]:
query_engine=vector_index.as_query_engine()

In [None]:
from llama_index.core.evaluation import SemanticSimilarityEvaluator
import asyncio
# Set up evaluator
evaluator = SemanticSimilarityEvaluator(
    embed_model=embed_model,
    similarity_threshold=0.75
)
# Reference is the HyDe-generated hypothetical doc
reference = res
async def score_and_sort_chunks(reference, retrieved_nodes):
    scored = []
    for node in retrieved_nodes:
        response = node.get_content()
        result = await evaluator.aevaluate(response=response, reference=reference)
        scored.append((result.score, response, result.passing))
    scored.sort(reverse=True, key=lambda x: x[0])
    return scored
# Run the scoring and sorting
scored_chunks = await score_and_sort_chunks(reference, retriever.retrieve(query))
print("\n Top scored chunks:")
for score, content, passing in scored_chunks[:5]:
    print(f"\nScore: {score:.3f} | Passing: {passing}\nContent: {content[:200]}...")


 Top scored chunks:

Score: 0.889 | Passing: True
Content: Prompt Engineering
February 2025
6
Introduction
When thinking about a large language model input and output, a text prompt (sometimes 
accompanied by other modalities such as image prompts) is the inp...

Score: 0.836 | Passing: True
Content: Prompt Engineering
February 2025
7
When you chat with the Gemini chatbot,1 you basically write prompts, however this 
whitepaper focuses on writing prompts for the Gemini model within Vertex AI or by ...

Score: 0.812 | Passing: True
Content: Prompt Engineering
February 2025
13
top-p settings. This can occur at both low and high temperature settings, though for different 
reasons. At low temperatures, the model becomes overly deterministic...

Score: 0.778 | Passing: True
Content: Prompt Engineering
February 2025
65
We recommend creating a Google Sheet with Table 21 as a template. The advantages of 
this approach are that you have a complete record when you inevitably have to r...

Score:

In [None]:
top_chunks = [chunk[1] for chunk in scored_chunks[:3]]  # top 3 contents
context = "\n\n".join(top_chunks)
final_prompt = f"""
Use the following context to answer the user's question.

Context:
{context}

Question: {query}

Answer:
"""

response = llm.invoke(final_prompt)
print(" Final Answer:\n", response.content.strip())

💡 Final Answer:
 Prompt engineering is the process of designing high-quality prompts that guide large language models (LLMs) to produce accurate outputs. This process involves tinkering to find the best prompt, optimizing prompt length, and evaluating a prompt’s writing style and structure in relation to the task. It is an iterative process, as inadequate prompts can lead to ambiguous or inaccurate responses.


In [None]:
import numpy as np
def HydeRAG(query):
 is_query_related_to_document= vector_index.as_query_engine().query(f"Is the following query: {query} related to this document ?")
 #  print(is_query_related_to_document)
 #  print(type(is_query_related_to_document))
 if("no" in str(is_query_related_to_document).lower()):
   return str(is_query_related_to_document)
 else:
 #  res=generate_hypothetical_answer(query,llm)
  print(res)
  print(" ")
  #  Embed HyDe-generated hypothetical answer
  res_embedding = embed_model.get_text_embedding(res)
  #  Retrieve chunks using the original query (you can use hypo-based retrieval too)
  _nodes = retriever.retrieve(query)
  #  Score by dot product
  scored_nodes = []
  for node in _nodes:
    chunk_embedding = embed_model.get_text_embedding(node.get_content())  # force embed
    dot_score = np.dot(res_embedding, chunk_embedding) / (
    np.linalg.norm(res_embedding) * np.linalg.norm(chunk_embedding))
    scored_nodes.append((dot_score, node))
  #  Sort and pick top-k
  scored_nodes.sort(reverse=True, key=lambda x: x[0])
  # print(score_nodes)
  top_k_nodes = [node for _, node in scored_nodes[:5]]
  #  Final generation
  context = "\n\n".join([node.get_content() for node in top_k_nodes])
  final_prompt = f"""
  Use the following context to answer the user's question.

  Context:
  {context}

  Question: {query}

  Answer:
  """

  response = llm.invoke(final_prompt)
  print(" Final Answer:\n", response.content.strip())
  return response.content.strip()
# query="What is zero-shot prompting and when should I use it?"

In [91]:
HydeRAG(query) #query= "what is prompt engineering ?"

💡 Final Answer:
 Prompt engineering is the process of designing high-quality prompts that guide Large Language Models (LLMs) to produce accurate outputs. This process involves tinkering to find the best prompt, optimizing prompt length, and evaluating a prompt’s writing style and structure in relation to the task. It is an iterative process of crafting, testing, analyzing, documenting, and refining prompts based on the model’s performance.


In [96]:
HydeRAG(" How does contextual prompting help improve model responses?")

Hypothetical Answer:
Contextual prompting significantly improves model responses by providing the AI with specific background information, examples, or constraints directly within the prompt. This additional context helps reduce ambiguity, guides the model towards the desired scope and tone, and supplies it with relevant knowledge it might not otherwise access or prioritize. Consequently, the model can generate more accurate, relevant, and coherent outputs that directly address the user's intent, moving beyond generic or potentially incorrect responses.
 
 Final Answer:
 Contextual prompting helps improve model responses by:

*   Providing specific details or background information relevant to the current conversation or task.
*   Helping the model understand the nuances of what's being asked.
*   Enabling the model to tailor its response accordingly.
*   Allowing the model to more quickly understand the request.
*   Leading to the generation of more accurate and relevant responses.


In [108]:
def generate_avg_embedding_from_multiple_hypotheticals(query, llm, embed_model, n=3):
    embeddings = []

    for i in range(n):
        prompt = f"""
You are a helpful assistant. Generate a short paragraph that might answer the user's question.

Question: {query}

Hypothetical Answer{i+1}:
"""
        response = llm.invoke(prompt)
        hypo_doc = response.content.strip()

        # Embed each hypothetical document
        emb = embed_model.get_text_embedding(hypo_doc)
        embeddings.append(emb)

    # Average all embeddings
    avg_embedding = [sum(x)/n for x in zip(*embeddings)]

    return avg_embedding


In [141]:
import numpy as np
def HydeRAG_Multiple(query):
 print(" ")
 is_query_related_to_document= vector_index.as_query_engine().query(f"Is the following query: {query} related to this document ?")
 #  print(is_query_related_to_document)
 #  print(type(is_query_related_to_document))
 #  print(str(is_query_related_to_document).lower())
 if("no" in str(is_query_related_to_document).lower()):
   return str(is_query_related_to_document)
 # Embed HyDe-generated hypothetical answer
 res_embedding =generate_avg_embedding_from_multiple_hypotheticals(query,llm,embed_model,n=3)
#  print(res_embedding)
 #  Retrieve chunks using the original query
 _nodes = retriever.retrieve(query)
 #  Score by cosine similarity
 scored_nodes = []
 for node in _nodes:
    chunk_embedding = embed_model.get_text_embedding(node.get_content())  # force embed
    dot_score = np.dot(res_embedding, chunk_embedding) / (
    np.linalg.norm(res_embedding) * np.linalg.norm(chunk_embedding))
    scored_nodes.append((dot_score, node))

 # Sort and pick top-k
 scored_nodes.sort(reverse=True, key=lambda x: x[0])
 top_k_nodes = [node for _, node in scored_nodes[:5]]

 # Final generation
 context = "\n\n".join([node.get_content() for node in top_k_nodes])
 final_prompt = f"""
 Use the following context to answer the user's question.

 Context:
 {context}

 Question: {query}

 Answer:
 """

 response = llm.invoke(final_prompt)
 print(" Final Answer:\n", response.content.strip())


In [104]:
HydeRAG_Multiple("What is ReAct prompting and how does it combine reasoning and action?")

 
 Final Answer:
 ReAct (reason & act) prompting is a paradigm that enables Large Language Models (LLMs) to solve complex tasks by combining natural language reasoning with the use of external tools (such as search or code interpreters). This allows the LLM to perform actions like interacting with external APIs to retrieve information, which is a step towards agent modeling.

ReAct combines reasoning and acting into a thought-action loop. This process works as follows:
1.  The LLM first reasons about the problem and generates a plan of action.
2.  It then performs the actions outlined in the plan and observes the results.
3.  The LLM uses these observations to update its reasoning and generate a new plan of action.
This thought-action loop continues until the LLM reaches a solution to the problem. This approach mimics how humans operate by reasoning verbally and taking actions to gain information.


In [112]:
HydeRAG("What is ReAct prompting and how does it combine reasoning and action?")

ReAct (Reasoning and Acting) prompting is a technique that enables large language models (LLMs) to interleave verbal reasoning traces with specific actions. It combines the LLM's ability to generate internal thoughts, plans, and reflections (reasoning) with its capacity to interact with external tools or environments (action). The model first "reasons" about the problem, formulating a plan or a step, then "acts" by executing a tool (like a search engine or calculator) based on that reasoning. It then observes the tool's output and "reasons" again about the next step, creating a dynamic, iterative loop that allows for more complex problem-solving, self-correction, and improved performance on tasks requiring multi-step operations.
 
 Final Answer:
 ReAct (reason & act) prompting is a paradigm that enables Large Language Models (LLMs) to solve complex tasks. It achieves this by combining natural language reasoning with the use of external tools (such as search or code interpreters) to per

In [110]:
HydeRAG_Multiple("How should I design a prompt for complex tasks?")

 
 Final Answer:
 Even for complex tasks, the prompt itself should be designed with simplicity, clarity, and conciseness. Avoid complex language and unnecessary information.

Here's how you should design a prompt for complex tasks:

1.  **Provide Examples:** This is highlighted as the most important best practice. For complex tasks, provide one-shot or few-shot examples within the prompt. This acts as a powerful teaching tool, showcasing desired outputs or similar responses, allowing the model to learn and tailor its generation. Ensure your examples are:
    *   Relevant to the task.
    *   Diverse.
    *   High quality and well-written.
    *   Include edge cases if you need robust output for varied inputs.

2.  **Be Specific about the Output:** A concise instruction might not be enough for complex tasks. Provide specific details about the desired format, style, or content of the response. This helps the model focus on what's relevant and improves accuracy.

3.  **Use Instructions ov

In [113]:
HydeRAG("How should I design a prompt for complex tasks?")

Hypothetical Answer:
For complex tasks, design your prompt with clarity, structure, and specificity. Begin by clearly defining the ultimate goal and providing all necessary context or background information. Break the task down into smaller, manageable steps or sub-tasks, outlining the desired process or logic for each. Crucially, specify the exact output format you expect, including any constraints, and consider providing examples of what good output looks like to minimize ambiguity and guide the AI effectively.
 
 Final Answer:
 For complex tasks, the context suggests designing prompts using the following strategies:

1.  **Provide Examples:** This is highlighted as the "most important best practice." For complex tasks, providing one-shot or few-shot examples within the prompt is highly effective. These examples act as a powerful teaching tool, showcasing desired outputs and helping the model learn and tailor its generation. Ensure your examples are relevant, diverse, high-quality, w

In [114]:
HydeRAG_Multiple("How do prompts interact with the model’s internal behavior?")

 
 Final Answer:
 Based on the provided context, prompts interact with the model's internal behavior in several ways:

1.  **Input for Prediction:** A text prompt is the input the model uses to predict a specific output. The clearer the prompt text, the better the LLM can predict the next likely text.
2.  **Influence on Efficacy:** Many aspects of a prompt, including word-choice, style, tone, structure, and context, affect the model's efficacy and its ability to provide meaningful output. Inadequate prompts can lead to ambiguous or inaccurate responses.
3.  **Guidance for Generation:** LLMs are tuned to follow instructions. Prompts guide how LLMs generate text:
    *   **System prompting** sets the overall context and purpose, defining the model's fundamental capabilities and overarching purpose. It can also specify how to return the output (e.g., format, structure).
    *   **Contextual prompting** provides specific details or background information, helping the model understand nuanc

In [124]:
HydeRAG_Multiple("How can prompts help reduce hallucinations?")

 
 Final Answer:
 According to the context, prompting for a JSON format can help limit hallucinations by forcing the model to create a structure.


In [125]:
HydeRAG("How do I make my prompts more reliable?")

To make your prompts more reliable, focus on clarity and specificity: clearly define the AI's role, provide concrete examples, and specify the desired output format. Incorporate constraints and guardrails to limit scope and prevent unwanted responses. Finally, iterate and test your prompts rigorously, refining them based on the consistency and quality of the AI's outputs.
 
 Final Answer:
 Based on the provided context, here's how to make your prompts more reliable:

1.  **Be Specific about the Output:** Provide specific details in your prompt (through system or context prompting) to help the model focus on what's relevant, improving overall accuracy. A concise or generic instruction might not guide the LLM enough. (Page 56)
2.  **Use Instructions over Constraints:** Guide the model on what it *should* do or produce (instructions) rather than just what it *should not* do (constraints). (Page 56)
3.  **Make Prompt Text Clearer:** The clearer your prompt text, the better it is for the LL

In [127]:
HydeRAG_Multiple("What should I include in a system prompt?")

 
 Final Answer:
 A system prompt sets the overall context and purpose for the language model, defining its fundamental capabilities and overarching purpose. It provides additional tasks or instructions to the system on how to return the output.

You should include explicit instructions on the desired format, style, or content of the response. Examples of what to include are:

*   **How to return the output:** For instance, specifying to "Only return the label in uppercase" or to "return the output in JSON format."
*   **Specific requirements for the output:** Such as generating a code snippet compatible with a specific programming language, or returning a certain structure.
*   **Instructions for safety and toxicity control:** For example, adding a line like "You should be respectful in your answer."
*   **Specific details** to help the model focus on what's relevant and improve accuracy.


In [140]:
HydeRAG("what is meditation ?")

'No, the query "what is meditation?" is not related to this document. The document\'s content pertains to prompt engineering, including topics like JSON repair, working with schemas, and best practices for prompt attempts.'

In [129]:
HydeRAG("How do I make sure the model stays focused?")

To ensure the model stays focused, craft your prompts with clarity and specificity, outlining the exact task, desired format, and any constraints like length or tone. Providing relevant context upfront can also prevent tangents. If the model begins to stray, gently redirect it with follow-up prompts that reiterate the core objective or correct its course, reinforcing the boundaries of the task.
 
 Final Answer:
 To make sure the model stays focused, you should:

1.  **Be specific about the desired output:** Provide specific details in your prompt (through system or context prompting) to help the model focus on what's relevant and improve overall accuracy. A concise or generic instruction might not guide the LLM enough.
    *   **DO:** "Generate a 3 paragraph blog post about the top 5 video game consoles. The blog post should be informative and engaging, and it should be written in a conversational style."
    *   **DO NOT:** "Generate a blog post about video game consoles."

2.  **Use 

In [130]:
HydeRAG_Multiple("How should I design a prompt for complex tasks?")

 
 Final Answer:
 Even for complex tasks, the prompt itself should be designed with simplicity, clarity, and conciseness. Avoid complex language and unnecessary information.

Here's how you should design a prompt for complex tasks:

1.  **Provide Examples:** This is highlighted as the most important best practice. For complex tasks, provide one-shot or few-shot examples within the prompt. This acts as a powerful teaching tool, showcasing desired outputs or similar responses, allowing the model to learn and tailor its generation. Ensure your examples are:
    *   Relevant to the task.
    *   Diverse.
    *   High quality and well-written.
    *   Include edge cases if you need robust output for varied inputs.

2.  **Be Specific about the Output:** A concise instruction might not be enough for complex tasks. Provide specific details about the desired format, style, or content of the response. This helps the model focus on what's relevant and improves accuracy.

3.  **Use Instructions ov