# 🧠 Hypothetical Prompt Embedding RAG (HyPE) - Q2Q Retrieval

Welcome to the **Hypothetical Prompt Embedding RAG** project!  
This notebook demonstrates an advanced Retrieval-Augmented Generation (RAG) workflow using **Hypothetical Prompt Embeddings (HyPE)**, inspired by recent research.

---

## 🚀 What is HyPE?

Traditional RAG systems retrieve document chunks based on query-to-document similarity.  
**HyPE** takes a novel approach:  
- For each document chunk, it generates a set of *hypothetical questions* that the chunk could answer.
- At query time, the user’s question is embedded and matched **directly to these hypothetical questions** (Q2Q matching), rather than to the original document text.

This leads to more precise and contextually relevant retrieval, bridging the gap between user queries and document content.

---

## 🛠️ How It Works

1. **Document Chunking:**  
    Documents are split into manageable chunks.

2. **Hypothetical Question Generation:**  
    For each chunk, an LLM generates essential questions that the chunk could answer.

3. **Embedding:**  
    Both hypothetical questions and user queries are embedded using a powerful embedding model.

4. **Q2Q Retrieval:**  
    At query time, the user’s question is matched against the pool of hypothetical questions using vector similarity (cosine/FAISS).

5. **Answer Generation:**  
    The most relevant chunks are used as context for the LLM to generate a final answer.

---

## 📦 Features

- End-to-end HyPE pipeline implemented from scratch
- Uses Google Gemini and Google Embeddings
- Supports both brute-force and FAISS-based retrieval
- Modular and extensible code

---
## CITATION 
``` 
Vake, Domen and Vičič, Jernej and Tošić, Aleksandar, Bridging the Question-Answer Gap in Retrieval-Augmented Generation: Hypothetical Prompt Embeddings. Available at SSRN: https://ssrn.com/abstract=5139335 or http://dx.doi.org/10.2139/ssrn.5139335
```

---

## 📖 References

- **Bridging the Question-Answer Gap in Retrieval-Augmented Generation: Hypothetical Prompt Embeddings**  
  [Research Paper Link](https://www.researchgate.net/publication/389032824_Bridging_the_Question-Answer_Gap_in_Retrieval-Augmented_Generation_Hypothetical_Prompt_Embeddings)

---

## ✨ Credits

This project is implemented from scratch, inspired by the above paper.  
All code and logic are original and tailored for educational and research purposes.

---

Happy experimenting! 🤖✨

In [None]:
'''
HyPE (Hypothetical Prompt Embeddings) is an advanced RAG enhancement technique that
precomputes hypothetical questions for each document chunk during indexing rather
than generating content at query time.
'''

In [7]:
import os
GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [None]:
from llama_index.core import VectorStoreIndex, Settings, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
documents = SimpleDirectoryReader("data").load_data()
splitter = SentenceSplitter(chunk_size=1000,chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(documents)
embed_model = GoogleGenAIEmbedding(
    model_name="text-embedding-004",
    embed_batch_size=100,
    api_key=""
)
Settings.embed_model = embed_model
Settings.llm = llm

In [None]:
def generate_hypothetical_embeddings(nodes, llm, embed_model):
    """
    For each node, generate hypothetical questions and embed them,
    associating each embedding with the original node.

    Returns:
        List of tuples: (question_embedding, original_node)
    """
    prompt = """
    Analyze the input text and generate essential questions that, when answered, capture
    the main points and core meaning of the text. The questions should be exhaustive and
    understandable without context. When possible, named entities should be referenced
    by their full name. Only answer with questions where each question should be written
    in its own line (separated by newline) with no prefix.
    """

    embedding_pairs = []

    for node in nodes:
        #  Generate questions from node text
        full_prompt = prompt + "\n\n" + node.text
        response = llm.invoke(full_prompt)

        #  Split response into individual questions
        questions = [q.strip() for q in response.content.split('\n') if q.strip()]

        #  Get embeddings for all questions in batch
        question_embeddings = embed_model.get_text_embedding_batch(questions)

        #  Associate each embedding with the original node
        for embedding in question_embeddings:
            embedding_pairs.append((embedding, node))

    return embedding_pairs

In [None]:
embedding_pairs=generate_hypothetical_embeddings(nodes,llm,embed_model)


In [85]:
print(embedding_pairs[54][1].text)

Reinforcement Learning
● Learn in an interactive environment by Trial-and-Error using Feedback 
(reward/penalty) for actions
○ In supervised learning there is feedback- correct / incorrect, HERE IT IS PENALTY/REWARD
○ In unsupervised learning the similarity/difference among data items found,HERE THE AIM IS 
TO LEARN POLICY / ACTION MODEL THAT MAXIMIZE REWARD


In [97]:
q_user="Is there a technique which is in between supervised and unsupervised learning ? explain it"

In [98]:
q_embed = embed_model.get_text_embedding(q_user)

In [99]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#  E = list of (question_embedding, node)
def retrieve_top_chunks(q_embed, E, top_k=5):
    similarities = []

    for emb, node in E:
        sim = cosine_similarity(
            np.array(q_embed).reshape(1, -1),
            np.array(emb).reshape(1, -1)
        )[0][0]
        similarities.append((sim, node))

    # Sort by similarity score
    similarities.sort(reverse=True, key=lambda x: x[0])

    # Return top-k unique nodes (chunks)
    seen = set()
    top_nodes = []
    for sim, node in similarities:
        if node.node_id not in seen:
            top_nodes.append((sim, node))
            seen.add(node.node_id)
        if len(top_nodes) >= top_k:
            break

    return top_nodes
top_nodes=retrieve_top_chunks(q_embed,embedding_pairs)

In [101]:
def generate_answer(query, top_nodes, llm):
    context = "\n\n".join([node.text for _, node in top_nodes])
    prompt = f"""Answer the following question based on the provided context.

    Question: {query}

    Context:
    {context}

    Answer:"""
    return llm.invoke(prompt)
final_answer=generate_answer(q_user,top_nodes,llm)
print(final_answer.content)

Yes, there is a technique called **Semi-supervised Learning** which is in between supervised and unsupervised learning.

**Explanation:**
Semi-supervised learning utilizes a small portion of labeled data along with a large amount of unlabeled data. It works by:
1.  Using the small portion of labeled data to learn a partial model.
2.  Using this partial model to generate "pseudo-labels" for the unlabeled data.
3.  Combining both the original labeled data and the newly pseudo-labeled data.
4.  Finally, learning a complete model from this combined dataset to make predictions for new examples.

This approach bridges the gap between supervised learning (which relies entirely on labeled data) and unsupervised learning (which uses only unlabeled data to find patterns).


In [67]:

question_embeddings = []
metadata = []
import faiss
import numpy as np

def build_faiss_index(nodes, llm, embed_model):
    index = faiss.IndexFlatL2(768)

    for node in nodes:
        # 1. Generate questions
        prompt = """Analyze the input text and generate essential questions that, when answered, capture
        the main points and core meaning of the text. Questions should be exhaustive and written line-by-line with no prefix.\n\n""" + node.text
        response = llm.invoke(prompt)

        questions = [q.strip() for q in response.content.split("\n") if q.strip()]
        if not questions:
            continue

        # 2. Get embeddings
        embeddings = embed_model.get_text_embedding_batch(questions)
        embeddings = [np.array(e).astype("float32") for e in embeddings]

        # 3. Normalize for cosine similarity
        for emb in embeddings:
            norm = np.linalg.norm(emb)
            if norm > 0:
                emb /= norm
            question_embeddings.append(emb)
            metadata.append((node.text, node.node_id))

    # Convert to FAISS format
    index.add(np.array(question_embeddings))
    return index, metadata

faiss_index, metadata = build_faiss_index(nodes, llm, embed_model)

In [None]:
def retrieve_chunks(query: str, faiss_index, metadata, embed_model, top_k=5):
    # embed the query
    q_vec = embed_model.get_text_embedding(query)
    q_vec = np.array(q_vec).astype("float32")

    # normalize for cosine similarity
    norm = np.linalg.norm(q_vec)
    if norm > 0:
        q_vec /= norm

    # Search FAISS
    _, I = faiss_index.search(np.array([q_vec]), top_k)

    # Retrieve corresponding chunks
    seen_ids = set()
    top_chunks = []
    for idx in I[0]:
        if idx < len(metadata):
            chunk_text, node_id = metadata[idx]
            if node_id not in seen_ids:
                top_chunks.append(chunk_text)
                seen_ids.add(node_id)
    return top_chunks

def generate_final_answer(query: str, retrieved_chunks, llm):
    context = "\n\n".join(retrieved_chunks)
    prompt = f"""
    You are an intelligent assistant. Use the following context to answer the user's question.

    Context:
    {context}

    Question:
    {query}

    Answer:
    """
    return llm.invoke(prompt)


In [69]:
def answer_query(query):
    retrieved_chunks = retrieve_chunks(query, faiss_index, metadata, embed_model, top_k=5)
    answer = generate_final_answer(query, retrieved_chunks, llm)
    return answer

In [72]:
query = "Is there a technique which is in between supervised and unsupervised learning ? explain it"
print(answer_query(query).content)


Yes, there is a technique called **Semi-supervised Learning** which is in between supervised and unsupervised learning.

**Explanation:**

*   **Semi-supervised Learning** uses a small portion of **Labeled Data** and a large number of **Unlabeled Data**.
*   It then learns a model and makes a prediction for a new example.

This makes it "in between" because:
*   **Supervised Learning** primarily uses **Labeled Data**.
*   **Unsupervised Learning** primarily uses **UNLABELED Data**.
*   **Semi-supervised Learning** leverages both, combining the benefits of having some labeled examples with the abundance of unlabeled data.


In [103]:
query_2="How do machine learning algorithms learn from data? explain that"
print(answer_query(query_2).content)

Machine Learning (ML) algorithms learn by building mathematical models from sample data and identifying patterns within that data. This process allows computers to learn without being explicitly programmed for every possible scenario. By using data and algorithms, ML aims to imitate the way humans learn, gradually improving its accuracy over time.


In [105]:
query_3="How do computers learn from data ? what is this technique of learning from data called ? explain that technique"
print(answer_query(query_3).content)

Computers learn from data by using **algorithms** to imitate the way humans learn. Specifically, **Machine Learning (ML) algorithms** learn mathematical models from sample data and patterns, gradually improving their accuracy over time. This process gives computers the ability to learn without being explicitly programmed.

This technique of learning from data is called **Machine Learning**.

**Machine Learning** is a branch of Computer Science and Artificial Intelligence that focuses on enabling computers to learn from data. It involves using data and algorithms to build mathematical models that can identify patterns and make predictions or decisions. Machine Learning is generally used for tasks such as prediction, classification, and clustering. There are various types of Machine Learning, including Supervised Learning, Unsupervised Learning, Reinforcement Learning, and Semi-supervised Learning.


In [106]:
query_4="what is data science ?"
print(answer_query(query_4).content)

I apologize, but the provided context does not contain information about "data science." It focuses on defining and explaining "Machine Learning."


In [107]:
query_5="Give a breakdown of the working of reinforcement learning "
print(answer_query(query_5).content)

Reinforcement Learning works as follows:

1.  **Interactive Environment:** The learning process takes place within an interactive environment.
2.  **Trial-and-Error:** The learning agent explores this environment by performing actions through a process of trial-and-error.
3.  **Feedback (Reward/Penalty):** For each action taken, the agent receives feedback in the form of a reward or a penalty. This type of feedback is distinct from the "correct/incorrect" labels found in supervised learning.
4.  **Goal:** The primary objective is to learn a "policy" or "action model" that dictates which actions to take in different situations, with the ultimate aim of maximizing the accumulated reward over time.


In [109]:
query_6="how do machines find patterns in data ? what kind of ML technique is this ?explain with a example"
print(answer_query(query_6).content)

Machines find patterns in data by using **Machine Learning (ML) algorithms** that learn **mathematical models** from sample data. These algorithms are designed to identify relationships, structures, or regularities within the data, gradually improving their accuracy without being explicitly programmed for every specific pattern.

The kind of ML technique that focuses on finding patterns in data, especially when the data is not labeled, is **Unsupervised Learning**. A common application of Unsupervised Learning for finding patterns is **Clustering**.

**Example:**
Imagine a large dataset of customer purchasing behavior, including items bought, frequency of purchases, and total spending. A machine learning algorithm using **Clustering** (an Unsupervised Learning technique) can analyze this data to find natural groupings or patterns among customers.

For instance, it might identify:
*   A cluster of "high-value shoppers" who buy frequently and spend a lot.
*   A cluster of "bargain hunter

In [115]:
query_7="how do machines learn relationship between a Dependent variable and a set of independent variables ? which class of technique is this ? "
print(answer_query(query_7).content)

Machines learn the relationship between a Dependent variable (target) and a set of independent variables (predictors) through **Regression**. This technique falls under **Supervised Learning**.


In [117]:
query_8="how can we make machines learn by giving them reward as we give to children for doing the right things? explain that"
print(answer_query(query_8).content)

The way machines can learn by giving them rewards, similar to how children learn, is through a branch of Machine Learning called **Reinforcement Learning**.

Here's how it works:

1.  **Interactive Environment:** The machine operates within an interactive environment.
2.  **Trial-and-Error:** It learns by performing actions through a process of trial-and-error.
3.  **Feedback (Reward/Penalty):** For each action it takes, the machine receives immediate feedback in the form of a **reward** (if the action was good or moved it closer to its goal) or a **penalty** (if the action was bad or moved it away from its goal). This is explicitly stated as being different from the "correct/incorrect" feedback in supervised learning.
4.  **Learning a Policy:** The machine uses this reward/penalty feedback to gradually learn a "policy" or "action model." This model dictates which actions it should take in different situations.
5.  **Maximizing Reward:** The ultimate aim of this learning process is to 