# Hypothetical document embedding (HyDE) in document retrieval

In this notebook, we implement a Hypothetical Document Embedding (HyDE) system for document retrieval. The purpose of HyDE is to transform a query question into a detailed hypothetical document, which can bridge the gap between short queries and long, complex documents in the vector space. This approach improves the retrieval of relevant documents, especially when the query is brief but the document is detailed.

Traditional retrieval systems often face a problem known as the semantic gap. A short user query might not match the longer documents well in the vector space. HyDE solves this by expanding the query into a more comprehensive, hypothetical document that answers the question in detail. This helps the query's vector representation to better match the documents in the retrieval process.

In [1]:
import os
from dotenv import load_dotenv

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import FAISS
from langchain import PromptTemplate

# Load environment variables from a .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Load PDF document and Split it into chunks
We will use `PyPDFLoader` to extract text from the PDF. It reads the PDF page by page and stores the extracted text in a list of document objects, where each document contains the content of a single page.

In [2]:
# Specify the path to the PDF document
path = "Understanding_Climate_Change.pdf"

# Load PDF documents
loader = PyPDFLoader(path)
documents = loader.load()

We now split the documents into smaller chunks using the `RecursiveCharacterTextSplitter`. This will help us avoid dealing with very large chunks of text, making document retrieval faster and more accurate.

In [3]:
# Split documents into chunks with overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Size of each chunk
    chunk_overlap=100,  # Overlap between consecutive chunks
    length_function=len
)

# Split the loaded documents into chunks
texts = text_splitter.split_documents(documents)

we are splitting the document into chunks of size 500 characters with 100 characters of overlap between chunks.
- `chunk_size` makes it more manageable for indexing and retrieval.
- `chunk_overlap` ensures that the context is preserved when the text is split. This helps maintain the flow of information between chunks.
- `length_function` tells the splitter to calculate the length of the chunks based on the number of characters, ensuring that the chunks are exactly the specified size.

#### Replace tabs with spaces
In many cases, PDFs may contain tab characters (`\t`) that were used for indentation but aren't necessary for the final processed text. We will replace these with spaces.

In [4]:
# Replace tab characters with spaces in the text content
for text in texts:
    text.page_content = text.page_content.replace('\t', ' ')  # Replace tabs with spaces

### Generate embeddings using OpenAI
Once the text is cleaned and processed, we can create embeddings for each of the chunks using OpenAI API. These embeddings represent the meaning of the text in a high-dimensional vector space.

In [5]:
# Initializes the OpenAI embeddings model
embeddings = OpenAIEmbeddings()

We will also use FAISS to efficiently store and index the embeddings, which allows us to perform similarity search and query efficiently.

In [6]:
# Create vector store using FAISS
vector_store = FAISS.from_documents(texts, embeddings)

Here, we create a FAISS vector store by:
- Generating embeddings for each chunk of text.
- Storing the embeddings in FAISS, which allows fast similarity search later.

This function automatically creates a flat (brute-force) index by default.

#### Initialize the language model
Next, we initialize the language model that will generate a hypothetical document from a query. 

In [7]:
# Initialize the LLM for hypothetical document generation
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini-2024-07-18", max_tokens=4000)

- **`ChatOpenAI`**: This uses OpenAI's ChatGPT model (specifically the GPT-4 variant) for generating detailed responses.
- **`temperature=0`**: This ensures that the model's output is deterministic, meaning it will give consistent responses for the same input.
- **`max_tokens=4000`**: This sets the maximum number of tokens (words) that the model can generate in a single response.

#### Create the prompt template for generating hypothetical documents
To guide the language model in generating a relevant document, we create a prompt template that defines how the model should respond to a query. This template asks the model to generate a document that answers the query in a detailed and comprehensive way.

In [8]:
# Create a prompt template for generating the hypothetical document
hyde_prompt = PromptTemplate(
    input_variables=["query", "chunk_size"],
    template="""Given the question '{query}', generate a hypothetical document that directly answers this question. 
    The document should be detailed and in-depth. The document size should be exactly {chunk_size} characters."""
)

- **`input_variables=["query", "chunk_size"]`**: These are the variables the model will use. `query` is the user’s input question, and `chunk_size` determines the length of the document.
- **`template`**: This defines the exact structure of the prompt sent to the model.

#### Create the LLM chain for hypothetical document generation
We create an LLM chain that connects the prompt template with the language model. This chain ensures that the query is processed through the prompt template first and then passed to the GPT-4 model for document generation. This enables a smooth flow from input to output.

In [9]:
# Create an LLM chain for hypothetical document generation
hyde_chain = hyde_prompt | llm

### Generate the hypothetical document
Now, we use the prompt chain to generate the hypothetical document. We will take the query, pass it to the model, and receive a detailed hypothetical document in return.

In [10]:
# Define the input variables for the prompt
input_variables = {"query": "What is the main cause of climate change?", "chunk_size": 500}

# Generate the hypothetical document using the model
hypothetical_doc = hyde_chain.invoke(input_variables).content

- **`input_variables`**: The `query` is the user’s input question, and `chunk_size` is the length of the document we want to generate.
- **`invoke(input_variables)`**: This invokes the language model with the provided input variables, generating the hypothetical document.

### Perform similarity search in the vector store
With the hypothetical document generated, we now use it to search for similar documents in our FAISS vector store. This step retrieves documents from the original PDF that are most similar to the hypothetical document.

In [12]:
# Perform similarity search in the vector store using the hypothetical document
similar_docs = vector_store.similarity_search(hypothetical_doc, k=3)

# Extract the content of the retrieved documents
docs_content = [doc.page_content for doc in similar_docs]

- **`similarity_search(hypothetical_doc, k=3)`**: This searches for the top 3 most similar documents to the hypothetical document.
- **`docs_content`**: This stores the content of the top 3 similar documents retrieved.

Finally, we print the hypothetical document and the retrieved documents to see how well the system performed.

In [14]:
# Print the hypothetical document and the retrieved documents
print("Hypothetical Document:\n")
print(hypothetical_doc + "\n")

print("Retrieved Documents:\n")
for content in docs_content:
    print(content)
    print("\n")

Hypothetical Document:

**The Main Cause of Climate Change**

Climate change primarily results from human activities, particularly the burning of fossil fuels such as coal, oil, and natural gas. This process releases significant amounts of carbon dioxide (CO2) and other greenhouse gases into the atmosphere, enhancing the greenhouse effect. Deforestation further exacerbates the issue by reducing the number of trees that can absorb CO2. Additionally, industrial processes, agriculture, and waste management contribute to emissions. Collectively, these factors disrupt the Earth's climate systems, leading to global warming and associated environmental impacts.

Retrieved Documents:

predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. 
Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmo

By expanding the query into a detailed hypothetical document, we improve the chances of retrieving more relevant and accurate results.