# Context enrichment window for document retrieval

This notebook demonstrates how to use a context enrichment window technique for document retrieval in a vector database. Traditional vector search systems, such as FAISS, return isolated chunks of text based on the query. These chunks may lack surrounding context, making it difficult to fully understand the information. The aim is to improve the quality of search results by retrieving not only relevant chunks but also their neighboring chunks, providing more context for understanding the retrieved information.



In [1]:
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import pymupdf
from typing import List

# Load environment variables from a .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

## Document preprocessing
Let's start by reading the content of a PDF file and splitting it into manageable chunks.
### Read PDF to string
Here, we read the PDF content and convert it into a string using the `pymupdf` library. This will allow us to process the text further.


In [2]:
# Define the path to the PDF
path = "Understanding_Climate_Change.pdf"

# Open the PDF document
doc = pymupdf.open(path)
content = ""

# Iterate over each page and extract text
for page_num in range(len(doc)):
    # Get the current page
    page = doc[page_num]
    # Extract the text content from the current page and append it to the content string
    content += page.get_text()

The PDF file is opened using `pymupdf.open()`, and we iterate over all pages to extract the text. The extracted text is then concatenated into one large string. This will allow us to process it further and split it into manageable chunks.

### Split text into chunks
After reading the content of the PDF, we split the text into smaller chunks. This is done to make the text more manageable and easier to search through. The chunk size is set, and we introduce overlap between chunks to ensure relevant context is retained across chunks.

In [3]:
# Split the text into chunks with overlap
chunks_size = 400
chunk_overlap = 200
chunks = []
start = 0

# Loop to split the content into chunks until the end of the text
while start < len(content):
    # Calculate the end position of the current chunk
    end = start + chunks_size
    # Extract the current chunk from the conten
    chunk = content[start:end]
    # Append the chunk to the list, including its index in metadata
    chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": content}))
    # Move the starting index forward, ensuring overlap between chunks
    start += chunks_size - chunk_overlap  # Adjust to create overlap

We are defining a `chunks_size` of 400 characters for each chunk, and an `chunk_overlap` of 200 characters. The loop continues to run as long as the `start` position is less than the length of the `content`. Essentially, the loop breaks when we reach the end of the text.
- **Extracting a chunk**:
   - `end = start + chunks_size`: This calculates the end position of the current chunk. It's set to `start + chunks_size`, which means each chunk will be 400 characters long.
   - `chunk = content[start:end]`: This slices the text from the `start` position to the `end` position. It gives us a substring (i.e., a chunk) of the document. This chunk will be stored in a `Document` object.
- **Storing the chunk**:
   - `chunks.append(Document(page_content=chunk, metadata={"index": len(chunks), "text": content}))`: Here, we store each chunk in the `chunks` list. Each chunk is wrapped in a `Document` object, where:
     - `page_content=chunk` stores the actual content of the chunk.
     - `metadata={"index": len(chunks), "text": content}` adds additional metadata. The `index` tracks the chunk’s position in the text (useful for later retrieval). The `text` metadata holds the full text of the document, although this might be redundant in this case.
- **Update the `start` position**:
   - `start += chunks_size - chunk_overlap`: This is the critical line for overlap. It updates the `start` position to be `chunks_size - chunk_overlap` characters ahead of the current `start`. In simpler terms, if the chunk size is 400 and the overlap is 200, this means after each chunk, we "skip" the part that overlaps with the next chunk. As a result, the next chunk starts 200 characters before the end of the previous chunk, ensuring that the two chunks share some context.

## Vector store creation
Now, let's create a vector store where we store the text chunks as vectors. We will convert the text chunks into numerical vector representations using OpenAI's embeddings, and then store them in a FAISS vector store.

In [4]:
# Initialize the OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Create a FAISS vector store from the document chunks
vectorstore = FAISS.from_documents(chunks, embeddings)

# Create a retriever for the vector store to fetch relevant documents
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

- Here, we initialize the `OpenAIEmbeddings` class, which loads a model to convert the text of each chunk into a vector.
- The `FAISS.from_documents` method takes the list of `chunks` (which contains the document text) and the initialized `embeddings` model to convert each chunk into a vector. These vectors are then stored in a FAISS vector store.
- The `as_retriever()` method converts the FAISS vector store into a retriever object. This retriever is used to search for relevant documents based on the vectors stored in the FAISS index. The `search_kwargs={"k": 1}` parameter ensures that when we search for a query, only 1 document (the most relevant one) is returned. We can adjust this number (`k`) to return more documents if needed.


## Context-enriched retrieval
With the vector store in place, we can now enhance the retrieval process by not just returning a single relevant chunk, but also its surrounding context. This will help improve the comprehensiveness of the search results.

#### Define the query

In [5]:
# Define the query
query = "Explain the role of deforestation and fossil fuels in climate change."

This is the search query that will be used to retrieve the relevant chunks from the vector store.

#### Retrieve relevant chunks (the baseline chunk) using the retriever
We now use the `retriever` (which was set up with the FAISS vector store) to fetch the chunks of text that are most relevant to the input query.

In [7]:
# Retrieve the relevant chunk using the retriever
relevant_chunks = retriever.invoke(query)

The `get_relevant_documents(query)` function returns the most relevant chunk(s) of text that match the query, based on the vector search. This gives us the starting point for retrieving more context from the document. Since the retriever is set to return only one chunk (because of `k=1`), `relevant_chunks` will be a list containing the most relevant chunk to the query.

### Retrieve neighboring chunks with context enrichment
Now, we define a function, which allows us to retrieve a specific chunk by its index from the vector store.

In [8]:
def get_chunk_by_index(vectorstore, target_index: int) -> Document:
    """
    Retrieve a chunk from the vectorstore based on its index in the metadata.
    
    Args:
    vectorstore (VectorStore): The vectorstore containing the chunks.
    target_index (int): The index of the chunk to retrieve.
    
    Returns:
    Optional[Document]: The retrieved chunk as a Document object, or None if not found.
    """
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    for doc in all_docs:
        if doc.metadata.get('index') == target_index:
            return doc  # Return the document if the index matches
    return None  # Return None if no chunk with the target index is found

This function retrieves a chunk by searching through all documents in the vector store and matching the index stored in the metadata. If a chunk with the given index is found, it is returned. Otherwise, the function returns `None`.

#### Iterate over the relevant chunks
For each relevant chunk retrieved from the vector store, we fetch the neighboring chunks (before and after). The neighboring chunks are then sorted by index to ensure they appear in the correct order. We concatenate the chunks, considering the overlap between them to maintain context continuity. This enriched chunk is then added to the result.

In [9]:
# Set the number of neighboring chunks to retrieve
num_neighbors = 1

# Prepare the list of enriched chunk sequences
result_sequences = []

# Iterate over the relevant chunks
for chunk in relevant_chunks:
    current_index = chunk.metadata.get('index')
    if current_index is None:
        continue
    
    # Determine the range of chunks to retrieve (before and after the relevant chunk)
    start_index = max(0, current_index - num_neighbors)
    end_index = current_index + num_neighbors + 1  # +1 because range is exclusive at the end

    # Retrieve all chunks in the range
    neighbor_chunks = []
    for i in range(start_index, end_index):
        # Retrieve the chunk by its index
        neighbor_chunk = get_chunk_by_index(vectorstore, i)
        if neighbor_chunk:
            neighbor_chunks.append(neighbor_chunk)

    # Sort the chunks by their index to ensure correct order
    neighbor_chunks.sort(key=lambda x: x.metadata.get('index', 0))

    # Concatenate the chunks, accounting for overlap
    concatenated_text = neighbor_chunks[0].page_content
    for i in range(1, len(neighbor_chunks)):
        current_chunk = neighbor_chunks[i].page_content
        overlap_start = max(0, len(concatenated_text) - chunk_overlap)
        concatenated_text = concatenated_text[:overlap_start] + current_chunk

    # Append the concatenated result to the final list
    result_sequences.append(concatenated_text)

- `num_neighbors = 1` means that we want to retrieve 1 chunk before and 1 chunk after the relevant chunk.
- In `result_sequences` we store the context-enriched sequences after processing the relevant chunks and their neighboring chunks.

We loop over each relevant chunk returned by the retriever. For each chunk:
- We retrieve its index from the metadata (`current_index = chunk.metadata.get('index')`). If the index is missing, we skip that chunk (`continue`), as we need the index to fetch neighboring chunks.
- We calculate the indices of the neighboring chunks that we want to retrieve:
    - `start_index`: We subtract the number of neighbors from the current index to get the index of the first neighboring chunk before the relevant chunk. The `max(0, ...)` ensures we don’t go below index 0.
    - `end_index`: This is one step beyond the current index plus the number of neighbors, so we can retrieve chunks that come after the relevant chunk. The `+1` ensures we include the chunk after the relevant one because Python ranges are exclusive at the end.
- Then, we fetch neighboring chunks. For each index within the range of neighboring chunks, we call the `get_chunk_by_index` function to fetch the corresponding chunk from the vector store. If the chunk is found (i.e., it's not `None`), we add it to the `neighbor_chunks list`. This helps us build a list of chunks that are relevant to the context of the current chunk.
- Once we have gathered the neighboring chunks, we sort them by their index (`x.metadata.get('index', 0)`). This ensures that the chunks are in the correct order, which is critical for maintaining the continuity of the content.
- Later, we concatenate chunks with overlap. We start with the first neighboring chunk and progressively add the next ones, ensuring there is overlap between consecutive chunks. The overlap ensures that context from the previous chunk is retained when merging them together. The overlap is managed by slicing the previous concatenated chunk to remove the overlapping part, then appending the next chunk’s content. This creates a smooth flow between the chunks.
- After merging the neighboring chunks, we append the resulting sequence to the `result_sequences list`. This list holds all the enriched context sequences for further processing or output.

### Compare the baseline chunk the enriched chunk

In [10]:
print("Regular retrieval:\n")
print(relevant_chunks[0].page_content)  # The content of the baseline chunk

print("\nRetrieval with context enrichment:\n")
print(result_sequences[0])  # Example of enriched context for the first relevant chunk

Regular retrieval:

ntribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
higher heat-trapping capability than CO2, albeit in smaller 

Retrieval with context enrichment:

n. 
Boreal Forests 
Boreal forests, found in the northern regions of North America, Europe, and Asia, also play a 
crucial role in sequestering carbon. Logging and land-use changes in these regions contribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
paddies, and the use of synthetic fertilizers. Methane is a potent greenhouse gas with a much 
hi