# Hypothetical Prompt Embeddings (HyPE) 

## Overview
Based on [research by Chen et al.](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5139335), HyPE improves RAG by **precomputing hypothetical questions at index time**, not query time.

### HyPE vs HyDE
| | HyDE | HyPE |
|---|---|---|
| **When** | Query time | Index time |
| **What** | Generate hypothetical *answer* from query | Generate hypothetical *questions* from chunks |
| **Matching** | Answer ↔ Document | Question ↔ Question |
| **Cost** | Per-query LLM call | One-time LLM call during indexing |
| **Latency** | Adds latency per query | Zero runtime overhead |


### Pipeline
```
INDEX TIME:  PDF → chunks → LLM generates questions per chunk → embed questions → FAISS
                                                                    (each question maps back to its chunk)

QUERY TIME:  User query → embed query → FAISS search (question ↔ question) → return original chunks
```

In [24]:
import os
import faiss
import numpy as np
from typing import List, Dict, Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

True

In [25]:
os.chdir(r"C:\Users\TempAccess\Documents\Dhruv\RAG")
print(os.getcwd())

C:\Users\TempAccess\Documents\Dhruv\RAG


In [26]:
from helper_function_openai import (
    Document,
    RetrievalResult,
    OpenAIEmbedder,
    OpenAIChat,
    read_pdf,
    chunk_text,
    show_context,
)

print("Helpers imported")

Helpers imported


# Constants

In [27]:
PATH = r"data\Understanding_Climate_Change.pdf"
LANGUAGE_MODEL_NAME = "gpt-4o-mini"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
CHUNK_SIZE = 4000
CHUNK_OVERLAP = 500

print(f"PDF: {PATH}")
print(f"LLM: {LANGUAGE_MODEL_NAME}, Embeddings: {EMBEDDING_MODEL_NAME}")
print(f"Chunk size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")

PDF: data\Understanding_Climate_Change.pdf
LLM: gpt-4o-mini, Embeddings: text-embedding-3-small
Chunk size: 4000, Overlap: 500


## HyPE Vector Store

The key data structure: a FAISS index where **each entry is a question embedding**, but it maps back to the **original chunk**. One chunk → multiple question embeddings.

This is different from a standard vector store where each entry maps 1:1 to a document.

In [28]:
class HyPEVectorStore:
    def __init__(self, dimension:int):
        self.index = faiss.IndexFlatL2(dimension)
        self.documents:List[Document] = []

    def add_question_embeddings(
        self,
        chunk_doc: Document,
        question_embedding: List[List[float]]
    ):

        """
        Add multiple question embeddings that all map to the same chunk.
        """
        
        vectors = np.array(question_embedding, dtype=np.float32)
        self.index.add(vectors)
        
        for _ in question_embedding:
            self.documents.append(chunk_doc)

    def search(self, query_embedding:List[float], k:int = 3) -> List[RetrievalResult]:
        """
        Search for similar question embeddings, return the original chunks.
        Deduplicates chunks (since one chunk may match via multiple questions).
        """
        
        query_vec = np.array([query_embedding], dtype=np.float32)
        
        ## Search more than k since we need to deduplicate chunks
        search_k = min(self.index.ntotal, k * 3)
        
        distances, indices = self.index.search(query_vec, search_k)

        seen_chunks = {}

        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1:
                continue

            doc = self.documents[idx]
            chunk_key = doc.content[:100]

            if chunk_key not in seen_chunks or dist < seen_chunks[chunk_key][0]:
                seen_chunks[chunk_key] = (dist, doc)

        
        ## sort by score and take top k
        sorted_results = sorted(seen_chunks.values(), key=lambda x: -x[0])[:k]
        
        return [
            RetrievalResult(document=doc, score=float(dist), rank=i+1) 
            for i, (dist, doc) in enumerate(sorted_results)
        ]


    @property
    def total_vectors(self)->int:
        return self.index.ntotal

    @property
    def unique_chunks(self)->int:
        return len({d.content[:100] for d in self.documents})

# Hypothetical Question Generation

In [29]:
llm = OpenAIChat(
    model_name=LANGUAGE_MODEL_NAME,
    temperature=0.0,
    max_tokens=5000
)

embedder = OpenAIEmbedder(
    model=EMBEDDING_MODEL_NAME,
)



In [30]:
def generate_hypothetical_prompt_embeddings(chunk_text_str: str) -> Tuple[str, List[List[float]]]:
    """
    Generate hypothetical questions for a chunk, then embed them.
    """

    messages = [
        {
            "role": "system",
            "content": (
                "You generate essential questions from text. "
                "Each question should be on one line, without numbering or prefixes."
            )
        },
        {
            "role": "user",
            "content": (
                "Analyze the input text and generate essential questions that, "
                "when answered, capture the main points of the text. "
                "Each question should be one line, without numbering or prefixes.\n\n"
                f"Text:\n{chunk_text_str}\n\nQuestions:"
            )
        }
    ]

    response = llm.chat(messages=messages)

    questions = [
        q.strip()

        for q in response.replace("\n\n","\n").split("\n")
        if q.strip() and len(q.strip()) > 10
    ]

    questions_embeddings = embedder.embed_texts(questions)

    return chunk_text_str, questions_embeddings

# Build HyPE Index (Parallel Processing)

In [31]:
def build_hype_index(
    file_path:str,
    chunk_size:int=1000,
    chunk_overlap:int=200,
    max_workers:int=5
) -> HyPEVectorStore:

    print(f"reading pdf: {file_path}")
    text = read_pdf(file_path=file_path)

    chunks = chunk_text(text=text, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    print(f"Created {len(chunks)} chunks (size={chunk_size}, overlap={chunk_overlap})")


    ## so now lets generate hypothetical prompts for each chunk
    store = HyPEVectorStore(
        dimension=embedder.dimension
    )

    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = [
            pool.submit(
                generate_hypothetical_prompt_embeddings,
                chunk
            )

            for chunk in chunks
        ]
        

        for i, future in enumerate(tqdm(as_completed(futures), total=len(chunks))):
            chunk_content, question_embeddings = future.result()

            chunk_doc = Document(
                content = chunk_content,
                metadata = {
                    "source": file_path,
                    "chunk_id": i
                }
            )

            store.add_question_embeddings(
                chunk_doc,
                question_embeddings
            )

    print(f"\nHyPE index built:")
    print(f"  Unique chunks:     {store.unique_chunks}")
    print(f"  Total vectors:     {store.total_vectors}")
    print(f"  Avg questions/chunk: {store.total_vectors / store.unique_chunks:.1f}")
    
    return store

# Build The Index

In [32]:
hype_store = build_hype_index(
    file_path=PATH,
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

hype_store

reading pdf: data\Understanding_Climate_Change.pdf
Created 19 chunks (size=4000, overlap=500)


100%|██████████| 19/19 [00:20<00:00,  1.08s/it]


HyPE index built:
  Unique chunks:     19
  Total vectors:     270
  Avg questions/chunk: 14.2





<__main__.HyPEVectorStore at 0x15b8ae41c10>

# Test Retrieval

In [33]:
def retrieve(store: HyPEVectorStore, query:str, k:int=3) -> list[RetrievalResult]:
    """
    Query The HyPE Store - embed query, search question embeddings, return chunks.
    """
    
    query_embedding = embedder.embed_text(query)

    return store.search(query_embedding, k=k)

In [34]:
test_query = "What is the main cause of climate change?"

results = retrieve(hype_store, test_query)

In [35]:
context = list({r.document.content for r in results})

show_context(context=context)


Context 1:
ts play key roles in these efforts. 
Chapter 7: The Economics of Climate Change 
Costs of Inaction 
Economic Impacts of Climate Change 
The economic costs of climate change include damage to infrastructure, reduced agricultural 
productivity, health care costs, and lost labor productivity. Extreme weather events, such as 
hurricanes and floods, can cause significant economic disruption. Investing in climate action 
now can prevent much higher costs in the future. 
Social and Environmental Costs Climate change exacerbates social inequalities, with marginalized communities often bearing 
the brunt of its impacts. Environmental costs include loss of biodiversity, ecosystem 
degradation, and decreased availability of natural resources. Addressing these issues requires 
integrated, equitable solutions. 
Benefits of Climate Action 
Economic Opportunities 
Investing in renewable energy, energy efficiency, and sustainable practices creates jobs and 
stimulates economic growth. The 

In [36]:
print(f"\nRetrieval details for: \"{test_query}\"\n")
for r in results:
    print(f"  Rank {r.rank} | Score: {r.score:.4f} | Chunk #{r.document.metadata.get('chunk_id', '?')}")
    print(f"    {r.document.content[:150]}...\n")


Retrieval details for: "What is the main cause of climate change?"

  Rank 1 | Score: 1.0315 | Chunk #6
    ts play key roles in these efforts. 
Chapter 7: The Economics of Climate Change 
Costs of Inaction 
Economic Impacts of Climate Change 
The economic c...

  Rank 2 | Score: 0.9882 | Chunk #4
    gas more accessible, but this comes with environmental and 
health concerns. 
Deforestation 
Forests act as carbon sinks, absorbing CO2 from the atmos...

  Rank 3 | Score: 0.9785 | Chunk #5
    ss. 
Public Transportation Innovations 
Investments in efficient and reliable public transportation systems can reduce the number of 
private vehicles...



In [37]:
test_queries = [
    "How does deforestation affect global warming?",
    "What are the effects of rising sea levels?",
    "What international agreements address climate change?",
    "How does methane contribute to greenhouse effect?"
]

for query in test_queries:
    results = retrieve(hype_store, query, k=2)
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    print(f"{'='*80}")
    for r in results:
        print(f"  [{r.score:.4f}] {r.document.content[:150]}...")


Query: How does deforestation affect global warming?
  [1.0255] energy-efficient appliances, improving insulation, and 
developing more fuel-efficient vehicles. 
Building Efficiency 
Energy-efficient buildings use ...
  [0.5247] gas more accessible, but this comes with environmental and 
health concerns. 
Deforestation 
Forests act as carbon sinks, absorbing CO2 from the atmos...

Query: What are the effects of rising sea levels?
  [1.0719] behaviors is essential for achieving climate goals. 
Encouraging sustainable lifestyles, reducing consumption, and promoting circular economy 
practic...
  [0.8357] utions. Collaboration between governments, academia, and the private sector can drive 
innovation and bring new technologies to market. Supporting R&D...

Query: What international agreements address climate change?
  [0.7458] ents. Regulations set mandatory standards for 
emissions and energy efficiency. Market-based mechanisms, such as carbon pricing and 
emissions trading...
  [0.711

## HyPE vs Standard Retrieval

In [38]:
from helper_function_openai import FAISSVectorStore

def build_standard_index(file_path:str, chunk_size:int=1000, chunk_overlap:int=200):
    """Standard RAG"""
    text = read_pdf(file_path=file_path).replace("\t"," ")
    chunks = chunk_text(text=text, chunk_size=chunk_size, chunk_overlap=chunk_overlap)


    documents = [Document(content=c, metadata={"source":file_path, "chunk_id":i}) for i, c in enumerate(chunks)]

    documents_embeddings = embedder.embed_documents(documents=documents)

    vector_store = FAISSVectorStore(dimension=embedder.dimension)

    vector_store.add_documents(documents=documents_embeddings)
    
    return vector_store

In [39]:
print("building standard index for comparison")
standard_store = build_standard_index(file_path=PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
print(f"Standard index: {standard_store.index.ntotal} vectors")

building standard index for comparison
Standard index: 19 vectors


In [40]:
def compare_retrieval(query:str, k:int=5):
    """Compare retrieval performance of standard vs HyPE"""
    query_embedding = embedder.embed_text(query)

    hype_results = hype_store.search(query_embedding, k=k)
    std_results = standard_store.search(query_embedding, k=k)

    print("\nQuery: {query}")
    print("-"*60)
    
    print(f"\n--- HyPE Retrieval (question-question matching) ---")
    for r in hype_results:
        print(f"  [{r.score:.4f}] {r.document.content[:130]}...")
    
    print(f"\n--- Standard Retrieval (query-chunk matching) ---")
    for r in std_results:
        print(f"  [{r.score:.4f}] {r.document.content[:130]}...")

    # Overlap check
    hype_chunks = {r.document.content[:100] for r in hype_results}
    std_chunks = {r.document.content[:100] for r in std_results}
    overlap = len(hype_chunks & std_chunks)
    print(f"\n  Overlap: {overlap}/{k} chunks in common")

In [41]:
compare_retrieval("What is the main cause of climate change?")


Query: {query}
------------------------------------------------------------

--- HyPE Retrieval (question-question matching) ---
  [1.0676] behaviors is essential for achieving climate goals. 
Encouraging sustainable lifestyles, reducing consumption, and promoting circu...
  [1.0602] conservation. Integrated water resource management supports sustainable 
and equitable water use. 
Urban Resilience 
Building urba...
  [1.0433] energy-efficient appliances, improving insulation, and 
developing more fuel-efficient vehicles. 
Building Efficiency 
Energy-effi...
  [1.0361] Paris Agreement is a landmark international accord that aims to limit global warming to 
well below 2 degrees Celsius above pre-in...
  [1.0315] ts play key roles in these efforts. 
Chapter 7: The Economics of Climate Change 
Costs of Inaction 
Economic Impacts of Climate Ch...

--- Standard Retrieval (query-chunk matching) ---
  [0.6171] Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate ch

In [42]:

compare_retrieval("How does deforestation affect global warming?")


Query: {query}
------------------------------------------------------------

--- HyPE Retrieval (question-question matching) ---
  [1.1356] Paris Agreement is a landmark international accord that aims to limit global warming to 
well below 2 degrees Celsius above pre-in...
  [1.1355] ducating the next generation 
fosters informed and engaged citizens. 
Teacher Training 
Providing training and resources for educa...
  [1.0983] behaviors is essential for achieving climate goals. 
Encouraging sustainable lifestyles, reducing consumption, and promoting circu...
  [1.0622] conservation. Integrated water resource management supports sustainable 
and equitable water use. 
Urban Resilience 
Building urba...
  [1.0255] energy-efficient appliances, improving insulation, and 
developing more fuel-efficient vehicles. 
Building Efficiency 
Energy-effi...

--- Standard Retrieval (query-chunk matching) ---
  [0.5274] Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate ch

In [43]:
compare_retrieval("What are renewable energy alternatives?")


Query: {query}
------------------------------------------------------------

--- HyPE Retrieval (question-question matching) ---
  [1.0897] nable building materials. These innovations 
contribute to energy savings and lower carbon footprints. 
Social Innovation 
Behavio...
  [1.0148] ts play key roles in these efforts. 
Chapter 7: The Economics of Climate Change 
Costs of Inaction 
Economic Impacts of Climate Ch...
  [0.9341] ss. 
Public Transportation Innovations 
Investments in efficient and reliable public transportation systems can reduce the number ...
  [0.9090] cular risks. Glacial melt also impacts hydropower generation and agriculture. 
Coastal Erosion 
Rising sea levels and increased st...
  [0.8861] ieving net-zero targets. 
Sustainable Agriculture 
Innovations in sustainable agriculture can help reduce emissions, enhance food ...

--- Standard Retrieval (query-chunk matching) ---
  [0.5013] ieving net-zero targets. 
Sustainable Agriculture 
Innovations in sustainable agric